fix(ws): give runWriter exclusive ownership of provider conn writes#106
Merged
Conversation
PR #104 closed the stale-write-deadline bug on reactive PONG/Close replies, but the underlying concurrent-writer hazard remains: runWriter (relay.go) and the readProviderLoop goroutine (via gobwas's ControlHandler) both write WS frames directly to the same provider conn. Each wsutil-level frame write issues TWO underlying conn.Write calls (header, then payload). A PONG reply arriving between the header and payload of an in-flight inference_request lands its own header+payload mid-frame and corrupts the WS framing for both. The Go race detector cannot see this because net.TCPConn.Write is internally locked; the hazard lives at the WS framing boundary, one layer up. Additional racing paths: the admin-reject Close in admin_endpoints.go (direct s.close + conn.Close inside a 200ms AfterFunc) and the two ack/auth_response json.Marshal failure paths in handleV1Conn / handleV2Conn -- all post-registerProviderSession, all racing runWriter on the wire. Fix (Option 2 from the original report -- single-writer ownership): * writeCh now carries providerFrame{raw, payload}. runWriter dispatches: text -> wsutil.WriteServerText (header + payload, both from the single writer goroutine -- no inter-write window for another goroutine to squeeze in); raw -> one conn.Write of pre-baked bytes. New enqueueRaw routes reactive control replies and server-initiated Close frames through the same channel so they serialize behind any in-flight text frame instead of racing it. * readClientData(conn, controlReply) takes a reply callback. The control-frame handler captures gobwas's PONG / Close-echo header+body into a bytes.Buffer (DisableSrcCiphering: true to match the wsutil.ControlFrameHandler contract with wsutil.Reader -- otherwise the payload gets XOR'd twice). The assembled frame is handed to controlReply for delivery. * Two reply paths: handshake-phase callers (handleConn, handleV2Conn proof read) pass s.directControlReply(conn) -- writes straight to conn after re-arming the write deadline, exactly as deadlineRefreshingWriter did, because no providerSession / runWriter exists yet. The post- handshake call from readProviderLoop passes session.enqueueRaw so every reactive reply funnels through the single runWriter. * New s.closeSession(session, code, reason) for post-runWriter Close paths: enqueues a Close frame via enqueueRaw, schedules conn.Close after a 100ms drain so runWriter can actually flush. Switches the three post-runWriter direct-Close sites: admin reject, and the two json.Marshal failure paths after registerProviderSession in handleV1Conn / handleV2Conn. * deadlineRefreshingWriter removed -- its role is now the directControlReply closure for pre-handshake reads. Regression test (relay_writer_serialize_test.go) drives 200 concurrent text sends + 200 inbound client PINGs over net.Pipe, parses every wire frame with gobwas.ReadFrame, and asserts: no parse errors, expected frame counts, send-order preservation for text, exactly-once PONG echo per PING. A temporary regression shim that forced the old direct-write path failed the test 5/5 with "unexpected opcode 8 (OpClose)" parsed out of garbage bytes -- the smoking gun for framing corruption -- so the test has teeth. Verified: go test ./internal/ws/ -race -count=1 passes (11s); new test stress -race -count=30 passes; go vet ./... and go build ./... clean module-wide. TestProviderControlPongSurvivesStaleWriteDeadline (the PR #104 regression test) still passes -- directControlReply preserves its deadline-refresh semantics for the pre-handshake path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the concurrent-writer hazard on provider WS conns that PR #104's deadline-refresh fix surfaced but didn't fully resolve.
runWriteris now the sole goroutine writing WS frames to a provider conn after handshake; reactive PONG/Close replies and server-initiated Close frames funnel throughwriteChas pre-baked raw frames instead of racing it on the wire.wsutil.WriteServerText/gobwas.WriteFrameeach issue two underlyingconn.Writecalls (header, then payload). Before this PR, runWriter and the read goroutine (via gobwas'sControlHandler.HandlePing/HandleClose) both wrote directly to the same conn. A PONG arriving between runWriter's header and payload writes interleaves at the WS framing boundary — invisible to-race(becausenet.TCPConn.Writeis internally locked) and surfaces as parser failures on the receiving side. Additional racing paths: admin-reject Close (admin_endpoints.go), and the twojson.Marshalfailure paths inhandleV1Conn/handleV2ConnafterregisterProviderSession.writeChnow carriesproviderFrame{raw, payload}; runWriter dispatches text →wsutil.WriteServerText, raw → oneconn.Writeof pre-baked bytes.readClientDatatakes acontrolReply func([]byte) error; the post-handshake path passessession.enqueueRaw, pre-handshake passesdirectControlReply(conn)(replacesdeadlineRefreshingWriterand preserves its deadline-refresh semantics —TestProviderControlPongSurvivesStaleWriteDeadlinestill passes). The control handler captures gobwas's PONG/Close-echo bytes into abytes.Buffer(withDisableSrcCiphering: true, matching the contractwsutil.ControlFrameHandlerrelies on withwsutil.Reader) and hands the assembled frame tocontrolReply. News.closeSession(session, code, reason)enqueues a Close frame + schedulesconn.Closeafter a 100ms drain; replaces the three post-runWriter direct-Close sites (admin reject; v1/v2 ack-marshal failure).relay_writer_serialize_test.go). 200 concurrent text sends + 200 inbound client PINGs overnet.Pipe. Parses every wire frame withgobwas.ReadFrame; asserts no parse errors, expected counts, send-order preservation for text, exactly-once PONG echo per PING. Has teeth: a temporary regression shim that forced the old direct-write path failed the test 5/5 withunexpected opcode 8 (OpClose)parsed out of garbage bytes (the framing-corruption signature).Test plan
go test ./internal/ws/ -race -count=1 -timeout 300s→ ok 11.0s (entire ws suite, including PR fix(ws): re-arm write deadline on reactive control replies #104'sTestProviderControlPongSurvivesStaleWriteDeadline)-race -count=30→ okgo vet ./.../go build ./...clean module-wideunexpected opcode 8(framing corruption)