Skip to content

fix(provider): drain WS receive loop so constrained providers hold ready#102

Merged
Augustas11 merged 1 commit into
mainfrom
fix/provider-ws-receive-drain
Jun 18, 2026
Merged

fix(provider): drain WS receive loop so constrained providers hold ready#102
Augustas11 merged 1 commit into
mainfrom
fix/provider-ws-receive-drain

Conversation

@Augustas11

Copy link
Copy Markdown
Owner

Root cause

A constrained provider (augustass-macbook-air, mlx-community/Llama-3.2-3B-Instruct-4bit) could not hold ready in the coordinator pool. It cycled:

connect -> coordinator "write tcp ... i/o timeout" / "provider websocket read failed"
        -> "provider websocket disconnected (grace 30000)" -> reconnect -> ...

on a cadence that tracked the provider's own keepalive period (ping at 30s -> drop ~30s; ping at 15s -> drop ~15s), so the provider never delivered a steady heartbeat. The gateway then computed no_awake_provider (phase5-gateway/internal/router/server.go:687) and buyers got 503.

Bilateral keepalive tracing (provider + coordinator journal) revealed two provider-side problems, both confirmed against the coordinator source:

  1. The keepalive WebSocket control PING was causing the disconnect. The coordinator's gobwas reader auto-writes a PONG to the raw conn when it receives a provider PING (phase4-coordinator/internal/ws/server.go:1289 readClientData / ControlFrameHandler). But the coordinator only ever sets the connection write deadline inside its runWriter text-frame path (phase4-coordinator/internal/ws/relay.go:106, write_timeout_s=10) and never clears it — it is an absolute deadline of last_text_write + 10s. Once the link has been idle past that deadline, the auto-PONG write fails immediately with write tcp ... i/o timeout, and readProviderLoop drops the session. The warmup-probe response (a text write via runWriter) refreshes the deadline, which is exactly why each connection survived ~the keepalive period and then died on the next ping.
  2. Provider control frames do not count as coordinator liveness. readProviderLoop ignores any non-text frame (server.go:1127, if op != OpText { continue }, skipping handleMessage / LastActivityAt), so a PING would not have kept the session alive even without problem (1).

This is also why bumping the coordinator's ws.write_timeout_s 10->30 (tried earlier, reverted) did not help — wrong knob; the kill is the stale-deadline auto-PONG write, not a slow legitimate write.

Note: the coordinator's stale-write-deadline behavior is a real coordinator bug, but it is out of scope for this provider-side change and is not touched here.

The fix (provider-side only; 1 source file + its test)

phase3-binary/Sources/macprovider-cli/CoordinatorClient.swift:

  1. Decouple the WS receive loop from message handling. receiveLoop ran on the CoordinatorClient actor and did socket.receive() then handle(message) serially; while handle() suspended (drain's waitUntilDrained up to drainTimeoutSeconds, warm_up's two state_update writes, token persist, or an InferenceRelay actor hop) the actor could not re-enter to call the next receive(), the OS WS read buffer backed up, and TCP backpressure stalled the coordinator's writes. Now a receive child task does only socket.receive() then continuation.yield (unbounded AsyncStreamyield never suspends and never drops control frames) and loops straight back, while one drainer child task calls handle() serially (inbound frame ordering preserved; control/heartbeat no longer blocked by inference handling, which spawns its own child Task and returns quickly). A handle() throw (CoordinatorDrainComplete, send failure) still unwinds to runReconnectLoop unchanged, via withThrowingTaskGroup.

  2. Replace the keepalive control PING with a short-interval heartbeat TEXT frame. startHeartbeat no longer sends a WS control ping; it sends a heartbeat text frame on a tick capped at 5s (keepaliveTickCeilingSeconds, well under the coordinator's 10s write deadline and any proxy idle timeout). A text frame routes through the coordinator's runWriter (which sets a fresh write deadline before writing) and reaches handleMessage (refreshing LastActivityAt). sendHeartbeat gains a resetWindow flag so the since-last metrics window is rolled only on the full coordinator interval — intermediate keepalive heartbeats report the same accumulating window — keeping heartbeat metrics unchanged from the prior one-per-interval cadence.

The now-unused sendWebSocketPing() is removed; the ProviderWebSocketTask.sendPing() protocol requirement and the URLSessionWebSocketTask extension (PR #101's resume-once guard) are left intact.

Diff: receive loop (before -> after)

Before:

private func receiveLoop(_ socket: ProviderWebSocketTask) async throws {
    while !Task.isCancelled {
        let message: URLSessionWebSocketTask.Message
        do {
            message = try await socket.receive()
        } catch {
            Self.keepaliveDebug("ws_receive_error error=...")
            throw error
        }
        try await handle(message)        // blocks the actor; next receive() cannot run
    }
}

After:

private func receiveLoop(_ socket: ProviderWebSocketTask) async throws {
    let (inbox, continuation) = AsyncStream.makeStream(
        of: URLSessionWebSocketTask.Message.self,
        bufferingPolicy: .unbounded
    )
    try await withThrowingTaskGroup(of: Void.self) { group in
        group.addTask {                              // receive: always drain the socket
            defer { continuation.finish() }
            while !Task.isCancelled {
                let message: URLSessionWebSocketTask.Message
                do { message = try await socket.receive() }
                catch { Self.keepaliveDebug("ws_receive_error ..."); throw error }
                continuation.yield(message)
            }
        }
        group.addTask { [self] in                    // drainer: serial handle(), ordering preserved
            for await message in inbox {
                try Task.checkCancellation()
                try await handle(message)
            }
        }
        do { try await group.next() } catch { group.cancelAll(); throw error }
        group.cancelAll()
    }
}

And in startHeartbeat: the per-tick sendWebSocketPing() + sendHeartbeat() (which slept the full interval first) becomes a tickSeconds = max(1, min(interval, 5)) loop that sends only sendHeartbeat(resetWindow:), rolling the window on the full interval.

Tests

  • swift build -c release --product macprovider-cli -> clean (Swift 6.3 strict concurrency, zero errors).
  • swift test -> 219 tests, 0 failures. The former testCoordinatorSessionSendsWebSocketPingBeforeHeartbeat is replaced by testCoordinatorSessionKeepaliveSendsHeartbeatTextFrameAndNoPing, asserting the keepalive emits a heartbeat text frame on the tick and no control ping (pingCount == 0).

Live validation (acceptance bar)

Patched binary deployed to ~/macprovider/macprovider-cli (backup macprovider-cli.pre-drainfix.bak); provider run with the real config (~/.config/macprovider/config.yaml).

1. 90s coordinator journal (3+ heartbeat cycles) -- NO write-timeout, holds ready:

04:48:45 ... {"provider_id":"augustass-macbook-air","state":"ready","slots_free":1,"slots_total":1,"message":"provider heartbeat"}
04:48:51 ... state:"ready" slots_free:1 ... heartbeat
04:48:56 ... state:"ready" ... heartbeat
   (steady every ~5-6s, all state:"ready", through ...)
04:50:14 ... {"provider_id":"augustass-macbook-air","state":"ready","slots_free":1,"slots_total":1,"message":"provider heartbeat"}

Zero write tcp ... i/o timeout, zero provider websocket disconnected for the whole window. (Pre-fix, those two lines appeared every ~30s.)

Provider-side keepalive trace over the same connection: one connect, heartbeat text frames every ~5s, 0 ws_ping, 0 keepalive_send_error for 100s+.

2. Gateway pool -- ready and stable (not flapping):

GET https://api.streamvc.live/v1/status
  pool: { total_providers: 1, ready: 1, degraded: 0, draining: 0, unavailable: 0 }
  models[Llama-3.2-3B-Instruct-4bit]: ready_provider_count: 1, slots_free: 1, available: true, availability: "available"

Re-checked 53s later -- still availability: "available", ready_provider_count: 1.

3. Buyer transaction through the gateway -> HTTP 200 (not 503):

POST https://api.streamvc.live/v1/chat/completions  (3B model, "say ok")
-> 200
{"id":"chatcmpl-...","model":"mlx-community/Llama-3.2-3B-Instruct-4bit","object":"chat.completion",
 "usage":{"total_tokens":40,"prompt_tokens":39,"completion_tokens":1},
 "choices":[{"finish_reason":"stop","index":0,"message":{"content":"Ok","role":"assistant"}}]}

Provider was stopped after validation (pkill -f "macprovider-cli serve", confirmed no process remains). Coordinator and Pearl were used read-only (journalctl only); no coordinator / gateway / nginx / config changes (git diff --name-only origin/main = the two provider files only).

🤖 Generated with Claude Code

A constrained provider (augustass-macbook-air, 3B model) could not hold
"ready" in the coordinator pool: it cycled connect -> coordinator
"write tcp ... i/o timeout" / "provider websocket read failed" ->
"disconnected (grace 30000)" -> reconnect, on a cadence that tracked the
provider's own keepalive period, so it never delivered a steady heartbeat
and the gateway computed no_awake_provider (buyers got 503).

Two provider-side fixes, both confirmed against the coordinator code and
validated live:

1. Decouple the WS receive loop from message handling. receiveLoop ran on
   the CoordinatorClient actor and did `await socket.receive()` then
   `await handle(message)` serially; while handle() suspended (drain's
   waitUntilDrained up to drainTimeoutSeconds, warm_up's two state_update
   writes, token persist, or an InferenceRelay hop) the actor could not
   re-enter to call the next receive(), the OS WS read buffer backed up,
   and TCP backpressure stalled the coordinator's writes. Now a receive
   child task does only receive() -> AsyncStream.yield (unbounded, so it
   never blocks or drops control frames) and loops back, while one drainer
   child calls handle() serially (frame ordering preserved). A handle()
   throw (CoordinatorDrainComplete, send failure) still unwinds to
   runReconnectLoop unchanged via withThrowingTaskGroup.

2. Replace the keepalive WebSocket control PING with a short-interval
   heartbeat TEXT frame. A provider->coordinator control PING was actively
   triggering the disconnect: the coordinator's gobwas reader auto-writes a
   PONG to the raw conn, but the coordinator only sets the connection write
   deadline inside its runWriter text path (relay.go:106, write_timeout_s=10)
   and never clears it, so once idle past that absolute 10s deadline the
   PONG write fails immediately with "write tcp ... i/o timeout" and the
   session is dropped. Provider control frames also do not count as liveness
   (readProviderLoop ignores non-text frames, server.go:1127). The keepalive
   now sends a heartbeat text frame on a tick capped at 5s (well under the
   10s write deadline); a text frame routes through runWriter (fresh write
   deadline) and refreshes LastActivityAt. The since-last metrics window is
   still rolled only on the full coordinator interval (sendHeartbeat gains a
   resetWindow flag), so heartbeat metrics are unchanged.

Provider-side only; the coordinator's stale-write-deadline defect is not
touched here. Full suite 219/219 green.

Live validation (provider augustass-macbook-air, Llama-3.2-3B-Instruct-4bit):
- 90s coordinator journal: NO write-timeout / disconnect lines; steady
  state:"ready", slots_free:1 heartbeats every ~5s.
- Gateway /v1/status: 3B availability "available", ready_provider_count 1,
  confirmed stable across a 53s re-check (not flapping).
- Buyer tx through gateway POST /v1/chat/completions -> HTTP 200 with
  content (not 503).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Augustas11 Augustas11 marked this pull request as ready for review June 18, 2026 05:10
@Augustas11 Augustas11 merged commit 591a0e2 into main Jun 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant