fix(provider): drain WS receive loop so constrained providers hold ready#102
Merged
Conversation
A constrained provider (augustass-macbook-air, 3B model) could not hold "ready" in the coordinator pool: it cycled connect -> coordinator "write tcp ... i/o timeout" / "provider websocket read failed" -> "disconnected (grace 30000)" -> reconnect, on a cadence that tracked the provider's own keepalive period, so it never delivered a steady heartbeat and the gateway computed no_awake_provider (buyers got 503). Two provider-side fixes, both confirmed against the coordinator code and validated live: 1. Decouple the WS receive loop from message handling. receiveLoop ran on the CoordinatorClient actor and did `await socket.receive()` then `await handle(message)` serially; while handle() suspended (drain's waitUntilDrained up to drainTimeoutSeconds, warm_up's two state_update writes, token persist, or an InferenceRelay hop) the actor could not re-enter to call the next receive(), the OS WS read buffer backed up, and TCP backpressure stalled the coordinator's writes. Now a receive child task does only receive() -> AsyncStream.yield (unbounded, so it never blocks or drops control frames) and loops back, while one drainer child calls handle() serially (frame ordering preserved). A handle() throw (CoordinatorDrainComplete, send failure) still unwinds to runReconnectLoop unchanged via withThrowingTaskGroup. 2. Replace the keepalive WebSocket control PING with a short-interval heartbeat TEXT frame. A provider->coordinator control PING was actively triggering the disconnect: the coordinator's gobwas reader auto-writes a PONG to the raw conn, but the coordinator only sets the connection write deadline inside its runWriter text path (relay.go:106, write_timeout_s=10) and never clears it, so once idle past that absolute 10s deadline the PONG write fails immediately with "write tcp ... i/o timeout" and the session is dropped. Provider control frames also do not count as liveness (readProviderLoop ignores non-text frames, server.go:1127). The keepalive now sends a heartbeat text frame on a tick capped at 5s (well under the 10s write deadline); a text frame routes through runWriter (fresh write deadline) and refreshes LastActivityAt. The since-last metrics window is still rolled only on the full coordinator interval (sendHeartbeat gains a resetWindow flag), so heartbeat metrics are unchanged. Provider-side only; the coordinator's stale-write-deadline defect is not touched here. Full suite 219/219 green. Live validation (provider augustass-macbook-air, Llama-3.2-3B-Instruct-4bit): - 90s coordinator journal: NO write-timeout / disconnect lines; steady state:"ready", slots_free:1 heartbeats every ~5s. - Gateway /v1/status: 3B availability "available", ready_provider_count 1, confirmed stable across a 53s re-check (not flapping). - Buyer tx through gateway POST /v1/chat/completions -> HTTP 200 with content (not 503). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
A constrained provider (
augustass-macbook-air,mlx-community/Llama-3.2-3B-Instruct-4bit) could not hold ready in the coordinator pool. It cycled:on a cadence that tracked the provider's own keepalive period (ping at 30s -> drop ~30s; ping at 15s -> drop ~15s), so the provider never delivered a steady heartbeat. The gateway then computed
no_awake_provider(phase5-gateway/internal/router/server.go:687) and buyers got 503.Bilateral keepalive tracing (provider + coordinator journal) revealed two provider-side problems, both confirmed against the coordinator source:
connwhen it receives a provider PING (phase4-coordinator/internal/ws/server.go:1289readClientData/ControlFrameHandler). But the coordinator only ever sets the connection write deadline inside itsrunWritertext-frame path (phase4-coordinator/internal/ws/relay.go:106,write_timeout_s=10) and never clears it — it is an absolute deadline oflast_text_write + 10s. Once the link has been idle past that deadline, the auto-PONG write fails immediately withwrite tcp ... i/o timeout, andreadProviderLoopdrops the session. The warmup-probe response (a text write viarunWriter) refreshes the deadline, which is exactly why each connection survived ~the keepalive period and then died on the next ping.readProviderLoopignores any non-text frame (server.go:1127,if op != OpText { continue }, skippinghandleMessage/LastActivityAt), so a PING would not have kept the session alive even without problem (1).This is also why bumping the coordinator's
ws.write_timeout_s10->30 (tried earlier, reverted) did not help — wrong knob; the kill is the stale-deadline auto-PONG write, not a slow legitimate write.The fix (provider-side only; 1 source file + its test)
phase3-binary/Sources/macprovider-cli/CoordinatorClient.swift:Decouple the WS receive loop from message handling.
receiveLoopran on theCoordinatorClientactor and didsocket.receive()thenhandle(message)serially; whilehandle()suspended (drain'swaitUntilDrainedup todrainTimeoutSeconds,warm_up's twostate_updatewrites, token persist, or anInferenceRelayactor hop) the actor could not re-enter to call the nextreceive(), the OS WS read buffer backed up, and TCP backpressure stalled the coordinator's writes. Now a receive child task does onlysocket.receive()thencontinuation.yield(unboundedAsyncStream—yieldnever suspends and never drops control frames) and loops straight back, while one drainer child task callshandle()serially (inbound frame ordering preserved; control/heartbeat no longer blocked by inference handling, which spawns its own childTaskand returns quickly). Ahandle()throw (CoordinatorDrainComplete, send failure) still unwinds torunReconnectLoopunchanged, viawithThrowingTaskGroup.Replace the keepalive control PING with a short-interval heartbeat TEXT frame.
startHeartbeatno longer sends a WS control ping; it sends a heartbeat text frame on a tick capped at 5s (keepaliveTickCeilingSeconds, well under the coordinator's 10s write deadline and any proxy idle timeout). A text frame routes through the coordinator'srunWriter(which sets a fresh write deadline before writing) and reacheshandleMessage(refreshingLastActivityAt).sendHeartbeatgains aresetWindowflag so the since-last metrics window is rolled only on the full coordinator interval — intermediate keepalive heartbeats report the same accumulating window — keeping heartbeat metrics unchanged from the prior one-per-interval cadence.The now-unused
sendWebSocketPing()is removed; theProviderWebSocketTask.sendPing()protocol requirement and theURLSessionWebSocketTaskextension (PR #101's resume-once guard) are left intact.Diff: receive loop (before -> after)
Before:
After:
And in
startHeartbeat: the per-ticksendWebSocketPing()+sendHeartbeat()(which slept the full interval first) becomes atickSeconds = max(1, min(interval, 5))loop that sends onlysendHeartbeat(resetWindow:), rolling the window on the full interval.Tests
swift build -c release --product macprovider-cli-> clean (Swift 6.3 strict concurrency, zero errors).swift test-> 219 tests, 0 failures. The formertestCoordinatorSessionSendsWebSocketPingBeforeHeartbeatis replaced bytestCoordinatorSessionKeepaliveSendsHeartbeatTextFrameAndNoPing, asserting the keepalive emits a heartbeat text frame on the tick and no control ping (pingCount == 0).Live validation (acceptance bar)
Patched binary deployed to
~/macprovider/macprovider-cli(backupmacprovider-cli.pre-drainfix.bak); provider run with the real config (~/.config/macprovider/config.yaml).1. 90s coordinator journal (3+ heartbeat cycles) -- NO write-timeout, holds ready:
Zero
write tcp ... i/o timeout, zeroprovider websocket disconnectedfor the whole window. (Pre-fix, those two lines appeared every ~30s.)Provider-side keepalive trace over the same connection: one connect, heartbeat text frames every ~5s, 0
ws_ping, 0keepalive_send_errorfor 100s+.2. Gateway pool -- ready and stable (not flapping):
Re-checked 53s later -- still
availability: "available",ready_provider_count: 1.3. Buyer transaction through the gateway -> HTTP 200 (not 503):
Provider was stopped after validation (
pkill -f "macprovider-cli serve", confirmed no process remains). Coordinator and Pearl were used read-only (journalctl only); no coordinator / gateway / nginx / config changes (git diff --name-only origin/main= the two provider files only).🤖 Generated with Claude Code