Detect slow or congested SDK deployments via application-level ping/pong probing

## Problem

When an SDK deployment becomes slow or congested, the Restate runtime currently has no way to detect or measure this. The problem manifests as increasingly large gaps between steps of an invocation, but the runtime only observes the symptom indirectly — through overall invocation duration or the coarse-grained inactivity timeout (default 60s).

Common causes of SDK deployment congestion include:
- **Single-threaded runtimes** (e.g., Node.js) where a blocking `ctx.run` callback starves the event loop, preventing other invocations sharing the same process from making progress.
- **Thread pool exhaustion** (e.g., Java) where all worker threads are occupied by long-running operations.
- **GIL contention** (Python) or **goroutine scheduling delays** (Go) under heavy load.
- **Resource pressure** — the SDK deployment is under memory/CPU pressure and can't process messages promptly.

The universal symptom across all languages/runtimes is: **the SDK deployment cannot process incoming protocol messages fast enough**. Messages from the runtime sit in TCP/HTTP/2 buffers and the SDK's internal queues before being dispatched.

### Why existing mechanisms are insufficient

- **Inactivity timeout** (default 60s): This is a kill switch, not a diagnostic tool. It fires when no messages are exchanged at all, but a congested SDK might still be sending messages — just slowly. It also can't distinguish between "the SDK is blocked" and "the SDK is legitimately idle waiting for a completion."
- **HTTP/2 PING frames**: These operate at the transport layer. Most HTTP/2 implementations (including hyper) handle PINGs in the I/O layer, potentially on a separate task. A transport-level PING will succeed even if the SDK's application-level message processing is completely blocked. Additionally, gateways/proxies typically don't propagate HTTP/2 PINGs between downstream and upstream.
- **Passive observation** (measuring inter-message timing on the runtime side): This depends on the SDK sending messages. If the SDK is blocked or slow, it won't send messages, so there's nothing to measure. You can only detect the problem after it has already manifested as a gap.
- **Health check endpoints / side-channel probes** (e.g., `GET /health` on a separate connection): These use a different TCP connection than the invocation streams. A separate connection doesn't experience the same queuing behavior as messages multiplexed on the shared HTTP/2 connection that carries invocation traffic. A health endpoint might respond promptly while invocation streams are backed up.

### What we need

A mechanism that:

1. **Is application-level**: The measurement must go through the SDK's normal message processing/dispatch pipeline, not just the transport layer. This is what ensures we're measuring the actual queuing and processing delays that affect invocations.
2. **Is demand-driven**: The SDK should only do extra work when the runtime asks for it. No overhead when probing is not needed.
3. **Is generic across all SDK languages/runtimes**: The solution must not depend on language-specific facilities (e.g., Node.js event loop monitoring). It must work identically for TypeScript, Java, Python, Go, Rust, etc.
4. **Travels the same path as invocation traffic**: The probe must share the same HTTP/2 connection and go through the same SDK message dispatch path as real invocation messages, so it experiences the same queuing behavior.
5. **Allows the runtime flexibility**: The runtime should be free to choose which streams to probe, how often, and what to do with the results (metrics, logging, alerts, circuit breaking, etc.). The protocol should not prescribe policy.

## Alternative: User-side SDK instrumentation

As a lighter-weight alternative that requires no protocol changes, users can instrument their SDK deployments directly to detect congestion on their side:

- **Node.js**: Use `perf_hooks.monitorEventLoopDelay()` to track event loop lag and export it as a metric.
- **Java**: Monitor thread pool utilization (e.g., active threads vs. pool size) and request queue depth.
- **Python**: Measure GIL contention or asyncio loop lag.
- **Go**: Monitor goroutine counts and scheduling latency via `runtime` package metrics.

Users would then set up alerts on these SDK-side metrics to detect when their deployment is becoming congested.

**Trade-offs**: This approach works today with zero runtime/protocol changes and gives language-specific detail that a generic mechanism cannot. However, it places the burden entirely on users to instrument, monitor, and correlate these metrics. It also doesn't give the Restate runtime any visibility into SDK health — the runtime cannot adapt its behavior (e.g., adjust concurrency, log warnings, or surface health in the admin API) based on information it never receives.

The two approaches are complementary: user-side instrumentation provides rich, language-specific diagnostics, while protocol-level probing gives the runtime direct visibility for operational purposes.

## Proposed solution: Protocol-level Ping/Pong

Add two new control messages to the service protocol:

- **`PingMessage`** (runtime → SDK): Contains a `nonce` (opaque identifier).
- **`PongMessage`** (SDK → runtime): Echoes the `nonce`.

### Protocol behavior

- The runtime sends a `PingMessage` on an active invocation's HTTP/2 stream during the bidirectional streaming phase.
- The SDK receives the `PingMessage` through its normal message dispatch loop (the same code path that processes completions, acks, and other runtime messages). The SDK responds with a `PongMessage` containing the same `nonce`.
- The runtime measures the time between sending the `PingMessage` and receiving the `PongMessage` — this is the application-level round-trip time.

### SDK requirements

- The SDK MUST process `PingMessage` in its normal message dispatch loop, NOT in a separate I/O thread or at the transport layer. This ensures the round-trip measures the full processing pipeline latency including any queuing.
- The SDK MUST respond with a `PongMessage` echoing the same `nonce`.
- The SDK implementation is trivial: a few lines in the message dispatcher.

### Runtime behavior (not prescribed by the protocol)

The protocol defines the message format and the SDK's obligation to respond. Everything else is runtime policy:

- **Which streams to probe**: The runtime could ping one stream per deployment, all streams, or a subset. Since all streams to the same deployment typically share the same SDK process, one probe per deployment is often sufficient.
- **How often to probe**: Configurable interval. Could be adaptive — probe more frequently when congestion is suspected.
- **What to do with the results**: Emit metrics (e.g., `restate.invoker.ping_rtt.seconds` histogram), log warnings when RTT exceeds thresholds, or eventually feed into deployment selection / circuit breaking.
- **When to start probing**: Could be always-on (with a long interval), or triggered by observed anomalies (large step gaps).

### Protocol version

This requires a new protocol version. SDKs that don't support the new version simply never receive pings — the feature is fully backward compatible via version negotiation.

### Why not other approaches

| Approach | Problem |
|----------|---------|
| SDK self-reports event loop metrics in existing messages | Depends on the SDK sending messages (doesn't work when SDK is blocked). Also language-specific — not all runtimes have a "poll time" concept. |
| Separate health check endpoint | Different TCP connection, doesn't share queuing with invocation traffic. |
| Runtime passively measures inter-message gaps | Only detects the problem after it manifests. Can't distinguish slow SDK from slow network from legitimate idle time. |
| HTTP/2 PING frames | Transport-level, handled by I/O layer, doesn't measure application processing. Blocked by proxies/gateways. |
| Adaptive inactivity timeout | Reactive, not proactive. Doesn't provide continuous measurement. |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect slow or congested SDK deployments via application-level ping/pong probing #4492

Problem

Why existing mechanisms are insufficient

What we need

Alternative: User-side SDK instrumentation

Proposed solution: Protocol-level Ping/Pong

Protocol behavior

SDK requirements

Runtime behavior (not prescribed by the protocol)

Protocol version

Why not other approaches

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approach	Problem
SDK self-reports event loop metrics in existing messages	Depends on the SDK sending messages (doesn't work when SDK is blocked). Also language-specific — not all runtimes have a "poll time" concept.
Separate health check endpoint	Different TCP connection, doesn't share queuing with invocation traffic.
Runtime passively measures inter-message gaps	Only detects the problem after it manifests. Can't distinguish slow SDK from slow network from legitimate idle time.
HTTP/2 PING frames	Transport-level, handled by I/O layer, doesn't measure application processing. Blocked by proxies/gateways.
Adaptive inactivity timeout	Reactive, not proactive. Doesn't provide continuous measurement.

Detect slow or congested SDK deployments via application-level ping/pong probing #4492

Description

Problem

Why existing mechanisms are insufficient

What we need

Alternative: User-side SDK instrumentation

Proposed solution: Protocol-level Ping/Pong

Protocol behavior

SDK requirements

Runtime behavior (not prescribed by the protocol)

Protocol version

Why not other approaches

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions