Background
CPU profiling against develop (and against #7775 + #7776 stacked, which removes the upstream log4js cost) attributes ~10% of total process CPU at the dive cliff to socket.io's per-recipient packet construction chain inside PadMessageHandler.updatePadClients:
3.36% emit [socket.io/dist/socket.js]
3.56% packet [socket.io/dist/socket.js]
3.31% _packet [socket.io/dist/client.js]
2.49% writeToEngine [socket.io/dist/client.js]
2.23% sendPacket [engine.io/build/socket.js]
The fan-out loop today does socket.emit('message', msg) once per recipient per missed revision:
// src/node/handler/PadMessageHandler.ts:950-1008
exports.updatePadClients = async (pad: PadType) => {
const roomSockets = _getRoomSockets(pad.id);
// ...
await Promise.all(roomSockets.map(async (socket) => {
while (sessioninfo.rev < pad.getHeadRevisionNumber()) {
// ... build msg ...
socket.emit('message', msg); // packet construction per recipient
sessioninfo.rev = r;
}
}));
};
In steady state every socket is at `head - 1`, so for N sockets behind by 1 rev that's N packet constructions of essentially identical content. The only per-recipient differing fields in `msg` are `timeDelta` and `currentTime` — and those fields are only consumed by the timeslider (`broadcast.ts`), not the live editor (`collab_client.ts:191-209` reads `{newRev, changeset, author, apool}` and nothing else).
The opportunity
socket.io's `room.emit()` / `io.in(roomId).emit()` uses the adapter `broadcast()` path, which calls the encoder once and then loops `writeToEngine` per recipient:
// socket.io-adapter/dist/in-memory-adapter.js:112
broadcast(packet, opts) {
const encodedPackets = this._encode(packet, packetOpts); // ONCE
// ...
socket.client.writeToEngine(encodedPackets, packetOpts); // per socket
}
Switching the steady-state path from `socket.emit` per recipient to `io.in(padId).emit` would collapse N packet constructions into 1. The realistic CPU savings at the dive cliff are 5-7% (some portion of the ~10% socket.io overhead — encoding and packet construction are shared, but the per-socket transport write still happens N times).
Why this needs design before implementing
Naively swapping `socket.emit` for `io.in(padId).emit` in the loop breaks the catch-up case:
- Steady state: all sockets at `head - 1`, one new rev → `room.emit(head)` works cleanly.
- Catch-up: one socket at `head - 5`, others at `head - 1` → a single `room.emit(head)` would be silently dropped by the lagger (`collab_client.ts:202-205` enforces `newRev === rev + 1` and `return`s on mismatch).
Two viable shapes:
Shape A — split steady-state from catch-up.
- For the most-recent rev only: `io.in(padId).except(stragglers).emit(...)` to all sockets at `head - 1`.
- For each straggler: individual `socket.emit` loop catches them up.
- Track stragglers by walking sessioninfos at fan-out entry.
Shape B — push catch-up out of `updatePadClients`.
updatePadClients only ever broadcasts the latest rev to the room.
- A client that detects it's behind (`newRev !== rev + 1`) requests resend via a new `CLIENT_REQUEST_RESEND` message, which the server handles in a separate code path.
- More invasive but cleaner long-term: keeps the hot path simple and pushes catch-up to a slow path.
I'd favour Shape A as a first cut — local change, no protocol additions, the straggler-tracking is cheap (`sessioninfo.rev !== headBefore` check per socket once per fan-out).
Acceptance
- Settings flag (
settings.roomBroadcastNewChanges default false) so it can be A/B tested in the dive.
- N=3 dive measurement before merging.
- No client-side change required for Shape A.
Out of scope
- The historicalAuthorData / pool fan-out path (covered by #7769 — closed, but the design problem is the same).
- Worker-thread offload of OT — falsified, see
experiment/worker-thread-applytotext branch.
- engine.io transport-level packing — see #7774 for the existing flush-deferral lever.
Discovered as part of the #7756 scaling dive (dive doc PR #7765).
Background
CPU profiling against develop (and against #7775 + #7776 stacked, which removes the upstream log4js cost) attributes ~10% of total process CPU at the dive cliff to socket.io's per-recipient packet construction chain inside
PadMessageHandler.updatePadClients:The fan-out loop today does
socket.emit('message', msg)once per recipient per missed revision:In steady state every socket is at `head - 1`, so for N sockets behind by 1 rev that's N packet constructions of essentially identical content. The only per-recipient differing fields in `msg` are `timeDelta` and `currentTime` — and those fields are only consumed by the timeslider (`broadcast.ts`), not the live editor (`collab_client.ts:191-209` reads `{newRev, changeset, author, apool}` and nothing else).
The opportunity
socket.io's `room.emit()` / `io.in(roomId).emit()` uses the adapter `broadcast()` path, which calls the encoder once and then loops `writeToEngine` per recipient:
Switching the steady-state path from `socket.emit` per recipient to `io.in(padId).emit` would collapse N packet constructions into 1. The realistic CPU savings at the dive cliff are 5-7% (some portion of the ~10% socket.io overhead — encoding and packet construction are shared, but the per-socket transport write still happens N times).
Why this needs design before implementing
Naively swapping `socket.emit` for `io.in(padId).emit` in the loop breaks the catch-up case:
Two viable shapes:
Shape A — split steady-state from catch-up.
Shape B — push catch-up out of `updatePadClients`.
updatePadClientsonly ever broadcasts the latest rev to the room.I'd favour Shape A as a first cut — local change, no protocol additions, the straggler-tracking is cheap (`sessioninfo.rev !== headBefore` check per socket once per fan-out).
Acceptance
settings.roomBroadcastNewChangesdefault false) so it can be A/B tested in the dive.Out of scope
experiment/worker-thread-applytotextbranch.Discovered as part of the #7756 scaling dive (dive doc PR #7765).