fix(provider): drain WS receive loop so constrained providers hold ready by Augustas11 · Pull Request #102 · Augustas11/macprovider

Augustas11 · 2026-06-18T04:57:10Z

Root cause

A constrained provider (augustass-macbook-air, mlx-community/Llama-3.2-3B-Instruct-4bit) could not hold ready in the coordinator pool. It cycled:

connect -> coordinator "write tcp ... i/o timeout" / "provider websocket read failed"
        -> "provider websocket disconnected (grace 30000)" -> reconnect -> ...

on a cadence that tracked the provider's own keepalive period (ping at 30s -> drop ~30s; ping at 15s -> drop ~15s), so the provider never delivered a steady heartbeat. The gateway then computed no_awake_provider (phase5-gateway/internal/router/server.go:687) and buyers got 503.

Bilateral keepalive tracing (provider + coordinator journal) revealed two provider-side problems, both confirmed against the coordinator source:

The keepalive WebSocket control PING was causing the disconnect. The coordinator's gobwas reader auto-writes a PONG to the raw conn when it receives a provider PING (phase4-coordinator/internal/ws/server.go:1289 readClientData / ControlFrameHandler). But the coordinator only ever sets the connection write deadline inside its runWriter text-frame path (phase4-coordinator/internal/ws/relay.go:106, write_timeout_s=10) and never clears it — it is an absolute deadline of last_text_write + 10s. Once the link has been idle past that deadline, the auto-PONG write fails immediately with write tcp ... i/o timeout, and readProviderLoop drops the session. The warmup-probe response (a text write via runWriter) refreshes the deadline, which is exactly why each connection survived ~the keepalive period and then died on the next ping.
Provider control frames do not count as coordinator liveness. readProviderLoop ignores any non-text frame (server.go:1127, if op != OpText { continue }, skipping handleMessage / LastActivityAt), so a PING would not have kept the session alive even without problem (1).

This is also why bumping the coordinator's ws.write_timeout_s 10->30 (tried earlier, reverted) did not help — wrong knob; the kill is the stale-deadline auto-PONG write, not a slow legitimate write.

Note: the coordinator's stale-write-deadline behavior is a real coordinator bug, but it is out of scope for this provider-side change and is not touched here.

The fix (provider-side only; 1 source file + its test)

phase3-binary/Sources/macprovider-cli/CoordinatorClient.swift:

Decouple the WS receive loop from message handling. receiveLoop ran on the CoordinatorClient actor and did socket.receive() then handle(message) serially; while handle() suspended (drain's waitUntilDrained up to drainTimeoutSeconds, warm_up's two state_update writes, token persist, or an InferenceRelay actor hop) the actor could not re-enter to call the next receive(), the OS WS read buffer backed up, and TCP backpressure stalled the coordinator's writes. Now a receive child task does only socket.receive() then continuation.yield (unbounded AsyncStream — yield never suspends and never drops control frames) and loops straight back, while one drainer child task calls handle() serially (inbound frame ordering preserved; control/heartbeat no longer blocked by inference handling, which spawns its own child Task and returns quickly). A handle() throw (CoordinatorDrainComplete, send failure) still unwinds to runReconnectLoop unchanged, via withThrowingTaskGroup.
Replace the keepalive control PING with a short-interval heartbeat TEXT frame. startHeartbeat no longer sends a WS control ping; it sends a heartbeat text frame on a tick capped at 5s (keepaliveTickCeilingSeconds, well under the coordinator's 10s write deadline and any proxy idle timeout). A text frame routes through the coordinator's runWriter (which sets a fresh write deadline before writing) and reaches handleMessage (refreshing LastActivityAt). sendHeartbeat gains a resetWindow flag so the since-last metrics window is rolled only on the full coordinator interval — intermediate keepalive heartbeats report the same accumulating window — keeping heartbeat metrics unchanged from the prior one-per-interval cadence.

The now-unused sendWebSocketPing() is removed; the ProviderWebSocketTask.sendPing() protocol requirement and the URLSessionWebSocketTask extension (PR #101's resume-once guard) are left intact.

Diff: receive loop (before -> after)

Before:

private func receiveLoop(_ socket: ProviderWebSocketTask) async throws {
    while !Task.isCancelled {
        let message: URLSessionWebSocketTask.Message
        do {
            message = try await socket.receive()
        } catch {
            Self.keepaliveDebug("ws_receive_error error=...")
            throw error
        }
        try await handle(message)        // blocks the actor; next receive() cannot run
    }
}

After:

private func receiveLoop(_ socket: ProviderWebSocketTask) async throws {
    let (inbox, continuation) = AsyncStream.makeStream(
        of: URLSessionWebSocketTask.Message.self,
        bufferingPolicy: .unbounded
    )
    try await withThrowingTaskGroup(of: Void.self) { group in
        group.addTask {                              // receive: always drain the socket
            defer { continuation.finish() }
            while !Task.isCancelled {
                let message: URLSessionWebSocketTask.Message
                do { message = try await socket.receive() }
                catch { Self.keepaliveDebug("ws_receive_error ..."); throw error }
                continuation.yield(message)
            }
        }
        group.addTask { [self] in                    // drainer: serial handle(), ordering preserved
            for await message in inbox {
                try Task.checkCancellation()
                try await handle(message)
            }
        }
        do { try await group.next() } catch { group.cancelAll(); throw error }
        group.cancelAll()
    }
}

And in startHeartbeat: the per-tick sendWebSocketPing() + sendHeartbeat() (which slept the full interval first) becomes a tickSeconds = max(1, min(interval, 5)) loop that sends only sendHeartbeat(resetWindow:), rolling the window on the full interval.

Tests

swift build -c release --product macprovider-cli -> clean (Swift 6.3 strict concurrency, zero errors).
swift test -> 219 tests, 0 failures. The former testCoordinatorSessionSendsWebSocketPingBeforeHeartbeat is replaced by testCoordinatorSessionKeepaliveSendsHeartbeatTextFrameAndNoPing, asserting the keepalive emits a heartbeat text frame on the tick and no control ping (pingCount == 0).

Live validation (acceptance bar)

Patched binary deployed to ~/macprovider/macprovider-cli (backup macprovider-cli.pre-drainfix.bak); provider run with the real config (~/.config/macprovider/config.yaml).

1. 90s coordinator journal (3+ heartbeat cycles) -- NO write-timeout, holds ready:

04:48:45 ... {"provider_id":"augustass-macbook-air","state":"ready","slots_free":1,"slots_total":1,"message":"provider heartbeat"}
04:48:51 ... state:"ready" slots_free:1 ... heartbeat
04:48:56 ... state:"ready" ... heartbeat
   (steady every ~5-6s, all state:"ready", through ...)
04:50:14 ... {"provider_id":"augustass-macbook-air","state":"ready","slots_free":1,"slots_total":1,"message":"provider heartbeat"}

Zero write tcp ... i/o timeout, zero provider websocket disconnected for the whole window. (Pre-fix, those two lines appeared every ~30s.)

Provider-side keepalive trace over the same connection: one connect, heartbeat text frames every ~5s, 0 ws_ping, 0 keepalive_send_error for 100s+.

2. Gateway pool -- ready and stable (not flapping):

GET https://api.streamvc.live/v1/status
  pool: { total_providers: 1, ready: 1, degraded: 0, draining: 0, unavailable: 0 }
  models[Llama-3.2-3B-Instruct-4bit]: ready_provider_count: 1, slots_free: 1, available: true, availability: "available"

Re-checked 53s later -- still availability: "available", ready_provider_count: 1.

3. Buyer transaction through the gateway -> HTTP 200 (not 503):

POST https://api.streamvc.live/v1/chat/completions  (3B model, "say ok")
-> 200
{"id":"chatcmpl-...","model":"mlx-community/Llama-3.2-3B-Instruct-4bit","object":"chat.completion",
 "usage":{"total_tokens":40,"prompt_tokens":39,"completion_tokens":1},
 "choices":[{"finish_reason":"stop","index":0,"message":{"content":"Ok","role":"assistant"}}]}

Provider was stopped after validation (pkill -f "macprovider-cli serve", confirmed no process remains). Coordinator and Pearl were used read-only (journalctl only); no coordinator / gateway / nginx / config changes (git diff --name-only origin/main = the two provider files only).

🤖 Generated with Claude Code

A constrained provider (augustass-macbook-air, 3B model) could not hold "ready" in the coordinator pool: it cycled connect -> coordinator "write tcp ... i/o timeout" / "provider websocket read failed" -> "disconnected (grace 30000)" -> reconnect, on a cadence that tracked the provider's own keepalive period, so it never delivered a steady heartbeat and the gateway computed no_awake_provider (buyers got 503). Two provider-side fixes, both confirmed against the coordinator code and validated live: 1. Decouple the WS receive loop from message handling. receiveLoop ran on the CoordinatorClient actor and did `await socket.receive()` then `await handle(message)` serially; while handle() suspended (drain's waitUntilDrained up to drainTimeoutSeconds, warm_up's two state_update writes, token persist, or an InferenceRelay hop) the actor could not re-enter to call the next receive(), the OS WS read buffer backed up, and TCP backpressure stalled the coordinator's writes. Now a receive child task does only receive() -> AsyncStream.yield (unbounded, so it never blocks or drops control frames) and loops back, while one drainer child calls handle() serially (frame ordering preserved). A handle() throw (CoordinatorDrainComplete, send failure) still unwinds to runReconnectLoop unchanged via withThrowingTaskGroup. 2. Replace the keepalive WebSocket control PING with a short-interval heartbeat TEXT frame. A provider->coordinator control PING was actively triggering the disconnect: the coordinator's gobwas reader auto-writes a PONG to the raw conn, but the coordinator only sets the connection write deadline inside its runWriter text path (relay.go:106, write_timeout_s=10) and never clears it, so once idle past that absolute 10s deadline the PONG write fails immediately with "write tcp ... i/o timeout" and the session is dropped. Provider control frames also do not count as liveness (readProviderLoop ignores non-text frames, server.go:1127). The keepalive now sends a heartbeat text frame on a tick capped at 5s (well under the 10s write deadline); a text frame routes through runWriter (fresh write deadline) and refreshes LastActivityAt. The since-last metrics window is still rolled only on the full coordinator interval (sendHeartbeat gains a resetWindow flag), so heartbeat metrics are unchanged. Provider-side only; the coordinator's stale-write-deadline defect is not touched here. Full suite 219/219 green. Live validation (provider augustass-macbook-air, Llama-3.2-3B-Instruct-4bit): - 90s coordinator journal: NO write-timeout / disconnect lines; steady state:"ready", slots_free:1 heartbeats every ~5s. - Gateway /v1/status: 3B availability "available", ready_provider_count 1, confirmed stable across a 53s re-check (not flapping). - Buyer tx through gateway POST /v1/chat/completions -> HTTP 200 with content (not 503). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Augustas11 marked this pull request as ready for review June 18, 2026 05:10

Augustas11 merged commit 591a0e2 into main Jun 18, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(provider): drain WS receive loop so constrained providers hold ready#102

fix(provider): drain WS receive loop so constrained providers hold ready#102
Augustas11 merged 1 commit into
mainfrom
fix/provider-ws-receive-drain

Augustas11 commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Augustas11 commented Jun 18, 2026

Root cause

The fix (provider-side only; 1 source file + its test)

Diff: receive loop (before -> after)

Tests

Live validation (acceptance bar)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant