From 2645da23e47e1fd45ce37f304e01015cdd9f10e2 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Fri, 15 May 2026 20:24:51 +0100
Subject: [PATCH 01/15] =?UTF-8?q?docs:=20scaling=20dive=202026-05=20?=
 =?UTF-8?q?=E2=80=94=20first=20numbers-backed=20answer=20to=20#7756?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 2 deliverable from the scaling-dive program. Documents:

- Methodology (harness commit, runner shape, sweep specs, decision rules)
- Baseline curve at authors=20..200 against develop HEAD
- Per-lever scoring (perMessageDeflate deferred, nodemem no-effect,
  websocket-only refuted, raw ws not pursued)
- Recommendation: prototype fan-out batching as the next lever (the
  data identifies emits scaling O(N^2) as the dominant cost)

Closes Phase 2 of #7756. Phase 3 (batching prototype) is a separate
feature branch the dive workflow will score.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 133 +++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)
 create mode 100644 docs/scaling-dive-2026-05.md

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
new file mode 100644
index 00000000000..ce2468ade4d
--- /dev/null
+++ b/docs/scaling-dive-2026-05.md
@@ -0,0 +1,133 @@
+# Scaling dive — 2026-05
+
+**Closes Phase 2 of #7756.** First numbers-backed answer to "how many editors can be on one pad, and what is the bottleneck when it falls over?"
+
+## TL;DR
+
+Two clean conclusions from three matrix runs on the same GitHub-hosted `ubuntu-latest` runner shape:
+
+1. **Server-side changeset apply is not the bottleneck.** Even at 200 concurrent authors, `etherpad_changeset_apply_duration_seconds` mean is ~3.7–4.4 ms — well under client-perceived p95 (~20–25 ms). The remaining latency lives in *fan-out*, not in *apply*.
+2. **Dropping the socket.io polling fallback (`socketTransportProtocols: ["websocket"]`) makes things worse, not better, under high concurrency.** At 200 authors it nearly doubles client p95 (37 ms vs 20 ms baseline). The hypothesis that the polling fallback was costing us is falsified.
+
+Raising the node heap (`--max-old-space-size=4096`) makes no measurable difference — memory is not where the cost lives.
+
+Next step: prototype the **fan-out batching** lever (spec section 9 lever 3). Today `etherpad_socket_emits_total{type=NEW_CHANGES}` scales O(N²) — 1160 emits per 10s dwell at 20 authors grows to 66 032 emits at 200 authors. Coalescing N changesets within a configurable window before broadcasting should attack that directly.
+
+## Methodology
+
+- **Harness:** [`ether/etherpad-load-test`](https://github.com/ether/etherpad-load-test) at the post-#100 main (sim/ library + `--sweep` mode + `/stats/prometheus` scraping + `apply_mean_ms` / `emits_new_changes` CSV columns).
+- **Server-side instruments:** the three Prometheus counters added in #7762, enabled via `settings.scalingDiveMetrics=true`.
+- **SUT:** etherpad core `develop` HEAD at the time of run.
+- **Runner shape:** GitHub-hosted `ubuntu-latest` (4 vCPU, ~16 GB RAM). Same shape across all three matrix entries, so noise is constant.
+- **Workflow:** [`.github/workflows/scaling-dive.yml`](https://github.com/ether/etherpad-load-test/blob/main/.github/workflows/scaling-dive.yml), manual `workflow_dispatch`. Two runs analysed:
+  - **Run 25936626554** — default sweep `authors=10..80:step=10:dwell=15s:warmup=3s`.
+  - **Run 25936813657** — deeper sweep `authors=20..200:step=20:dwell=10s:warmup=2s` (used for the conclusions below).
+
+### Decision rules (per spec section 6)
+
+- p95 latency up *without* event-loop p99 up ⇒ network IO bound.
+- p95 latency up *with* event-loop p99 up ⇒ server CPU / event-loop bound.
+- p95 latency up *with* RSS climbing across steps ⇒ leak / backpressure.
+
+## Baseline curve
+
+The deep sweep on baseline (no levers, develop HEAD):
+
+| Step | p50 | p95 | p99 | EL p99 | apply_mean | emits_NEW_CHANGES | cpu_user (s) |
+|---:|---:|---:|---:|---:|---:|---:|---:|
+| 20  |  9 | 11 | 12 | 11 | 4.84 ms |  1 160 |  2.4 |
+| 40  |  8 | 11 | 12 | 12 | 4.62 ms |  3 520 |  4.0 |
+| 60  |  8 | 11 | 13 | 12 | 4.63 ms |  7 040 |  6.3 |
+| 80  | 10 | 17 | 19 | 12 | 5.18 ms | 11 780 |  9.5 |
+| 100 |  8 | 16 | 18 | 11 | 5.08 ms | 17 668 | 13.0 |
+| 120 |  5 | 12 | 16 | 11 | 4.55 ms | 24 793 | 17.5 |
+| 140 |  3 |  8 | 11 | 11 | 3.96 ms | 33 088 | 22.8 |
+| 160 |  4 |  9 | 11 | 11 | 3.62 ms | 42 563 | 29.0 |
+| 180 |  5 | 16 | 20 | 12 | 3.56 ms | 54 112 | 36.5 |
+| 200 |  7 | 20 | 25 | 12 | 3.67 ms | 66 032 | 44.0 |
+
+Reading against the decision rules:
+
+- p95 grows slowly (11 → 20 ms across the range), but doesn't cliff.
+- Event-loop p99 stays at 11–12 ms — flat. **Not event-loop bound.**
+- RSS climbs from 393 MB → 651 MB but no leak shape (it plateaus around step 100).
+- CPU is the headline: 200 authors burns 44 CPU-seconds in 10 s wall-clock — ~4.4 cores. The runner has 4 vCPU. We're saturating the CPU on fan-out work.
+
+So per the decision rules: **network/CPU bound, but the work is fan-out, not apply.** The `apply_mean` stays low while emits grow O(N²) with concurrency.
+
+## Lever 1 — perMessageDeflate
+
+**Not run.** Verifying that core's socket.io setup plumbs `perMessageDeflate` through settings is itself a small core PR. Folded into the recommendation below.
+
+## Lever 2 — `--max-old-space-size=4096` (NODE_OPTIONS)
+
+Run as the `nodemem` matrix entry. Selected step-by-step diff vs baseline:
+
+| Step | baseline p95 | nodemem p95 | Δ |
+|---:|---:|---:|---:|
+| 80 | 17 | 17 |  0 |
+| 120 | 12 | 16 | +4 |
+| 160 |  9 | 13 | +4 |
+| 200 | 20 | 13 | -7 |
+
+Noise within ±5 ms. RSS grows similarly. apply_mean and emits_NEW_CHANGES are essentially identical.
+
+**Verdict: no measurable effect.** The user's hunch on the issue (memory is not the bottleneck) is confirmed. Don't recommend bumping the heap as a scaling lever.
+
+## Lever 3 — fan-out batching
+
+**Deferred.** Requires a code change in `PadMessageHandler.ts` (specifically the per-socket loop in `updatePadClients` and/or the broadcast emit at line 627). Recommended as the next concrete code change. The harness is ready to score it as soon as a candidate branch exists — point the workflow's `core_ref` input at the branch.
+
+The `emits_new_changes` column on the curve table above is the direct measurement target. At 200 authors we're producing 66 032 emits per 10 s dwell. Halving the emit rate (by coalescing two changesets per emit on a sub-50 ms window) would directly reduce CPU.
+
+## Lever 4 — `socketTransportProtocols: ["websocket"]`
+
+Run as the `websocket-only` matrix entry. Selected step-by-step diff vs baseline:
+
+| Step | baseline p95 | websocket-only p95 | Δ | baseline apply_mean | ws-only apply_mean |
+|---:|---:|---:|---:|---:|---:|
+|  20 | 11 | 10 |  -1 | 4.84 ms | 3.67 ms |
+|  60 | 11 |  9 |  -2 | 4.63 ms | 3.28 ms |
+| 100 | 16 | 13 |  -3 | 5.08 ms | 3.27 ms |
+| 140 |  8 | 24 | **+16** | 3.96 ms | 5.13 ms |
+| 180 | 16 | 35 | **+19** | 3.56 ms | 8.07 ms |
+| 200 | 20 | 37 | **+17** | 3.67 ms | 8.77 ms |
+
+Below ~100 authors, websocket-only is a modest win (-1 to -3 ms p95). Above 120 authors it goes sharply worse: p95 doubles, apply_mean doubles, evloop_p99 jumps from 12 → 17. The websocket-only path also produced a single 271 ms tail max at step 40 — likely a handshake stall, but worth confirming with more runs.
+
+**Verdict: do not recommend dropping the polling fallback.** The cost of forcing all clients onto websocket compounds with concurrency. This was a legitimate hypothesis from issue #7756 (thread #1) that the dive *refutes*.
+
+## Lever 5 — raw `ws` (drop socket.io entirely)
+
+**Not pursued.** Lever 4 demonstrated that the transport choice within socket.io is already an inversion — dropping the polling fallback hurts. Ripping socket.io out entirely is high blast radius and the dive gives no signal that it would help. Defer indefinitely.
+
+## Recommendation
+
+In priority order:
+
+1. **Prototype fan-out batching** (lever 3). The dive identifies fan-out as the single dominant cost. Coalescing changesets within a sub-50 ms window inside `updatePadClients` is the highest-leverage code change. Open a feature branch in core; the harness scores it via `workflow_dispatch` with `core_ref` pointing at the branch.
+2. **Verify and run lever 1** (`perMessageDeflate`). Even if compression has overhead at low concurrency, at 200 authors the emit *bytes* are the second-order cost behind emit *count*. Worth scoring once lever 3 is in.
+3. **Do not merge lever 4.** Keep `socketTransportProtocols: ["websocket", "polling"]` as the default.
+4. **Do not merge lever 2.** No effect.
+5. **Add core counters for fan-out byte size** as a small follow-up to #7762. The histogram of changeset bytes per emit would make lever 1 scorable without instrumenting client-side.
+
+## Reproducing
+
+```
+# Trigger a dive run against any core ref.
+gh workflow run "Scaling dive" --repo ether/etherpad-load-test \
+  -f core_ref=develop \
+  -f sweep='authors=20..200:step=20:dwell=10s:warmup=2s'
+
+# Fetch artifacts.
+gh run download <RUN_ID> --repo ether/etherpad-load-test
+```
+
+Per-lever CSV / JSON / MD artifacts drop in `scaling-dive-{baseline,websocket-only,nodemem}/`. The CSV is plot-ready; the JSON has the full per-step `Snapshot.gauges`.
+
+## Out of scope (sequel issues worth filing)
+
+- The `apply_mean` calculation uses `histogram._sum / histogram._count` for a simple mean. A proper p99 from the bucket distribution would require parsing `_bucket{le=...}` rows in the harness. Worth adding to the Scraper if lever 3 scoring needs it.
+- The websocket-only step-40 spike (271 ms max) needs a second run to confirm it isn't a flake.
+- The harness sweep stops short of producing a *cliff* — even 200 authors didn't trip the breakage thresholds. A "big cluster" dive (multi-host harness) is the natural sequel but is explicitly out of scope per spec section 9.
+- Re-run with the same methodology after every batching-prototype iteration to track progress numerically.

From 80b3b740d11ff03f93de586dfd8e2ad4791465bd Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 06:23:24 +0100
Subject: [PATCH 02/15] docs(scaling-dive): rewrite with cliff-finding + Qodo
 fixes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Captures everything learned since the first draft:

- The "250-author cliff" was a measurement artefact from per-IP
  commitRateLimiting + colocated harness. Fixed via the
  etherpad-load-test#105 workflow patch. Real ceiling is ~350-400
  authors on a 4-vCPU GitHub runner.

- apply_mean ballooning at the cliff isn't slow code — it's OS
  preemption (7+ cores of work on 4 vCPU). Application-level JS
  rearrangement can't reach it.

- Two changes hold up under the dive: fan-out serialization
  + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and
  historicalAuthorData cache (#7769, neutral on dive but real
  production thundering-herd fix at join time).

- Four directions didn't pan out: WebSocket-only transport, heap
  bump, message-level batching alone (#7766 closed), and
  rebase-loop prefetch (#7770 closed). Each has a one-line cause
  documented for the record.

- Engine.io transport-level packing (#7767) is the meatiest
  untouched lever — sending multiple packets per WebSocket frame
  the way polling already does via encodePayload.

Qodo-flagged corrections incorporated:
1. The new instruments are Histogram + Counter + Gauge, not
   "three counters" — labelled correctly.
2. The lever-3 line reference now points at updatePadClients
   (lines 985-999) where NEW_CHANGES actually emits, not the
   wrong line 627 (handleSaveRevisionMessage).
3. Lever 3's results are written up against measured data, not
   "deferred".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 212 +++++++++++++++++++++++------------
 1 file changed, 140 insertions(+), 72 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index ce2468ade4d..518cfe4a8c4 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -1,133 +1,201 @@
 # Scaling dive — 2026-05
 
-**Closes Phase 2 of #7756.** First numbers-backed answer to "how many editors can be on one pad, and what is the bottleneck when it falls over?"
+**Closes Phase 2 of #7756.** Numbers-backed answer to "how many editors can be on one pad, and what is the bottleneck when it falls over?"
+
+Every claim links to a CI run whose `report.json` is downloadable for re-analysis.
 
 ## TL;DR
 
-Two clean conclusions from three matrix runs on the same GitHub-hosted `ubuntu-latest` runner shape:
+1. **The "250-author cliff" we kept hitting was a measurement artefact**, not a real ceiling. `NODE_ENV=production` enables Etherpad's per-IP `commitRateLimiting`. With the harness colocated on the SUT runner, all simulated authors share `127.0.0.1` = one bucket. At 200 authors × 5 edits/sec the bucket sits exactly at the default ceiling (`points: 1000`). New joiners' `CLIENT_READY` consumes a point and gets `disconnect: rateLimited`. Fixed in [etherpad-load-test#105](https://github.com/ether/etherpad-load-test/pull/105) by raising `points` to 1 000 000 in the dive workflow's `settings.json` setup. Production deployments with many client IPs are not affected.
+
+2. **The real ceiling on a github-hosted `ubuntu-latest` runner (4 vCPU) is ~350–400 concurrent authors per pad**, with `p95 ≈ 2000 ms` and the process consuming 7+ CPU-seconds per wall-second (over-saturated). See run [25949421120](https://github.com/ether/etherpad-load-test/actions/runs/25949421120).
 
-1. **Server-side changeset apply is not the bottleneck.** Even at 200 concurrent authors, `etherpad_changeset_apply_duration_seconds` mean is ~3.7–4.4 ms — well under client-perceived p95 (~20–25 ms). The remaining latency lives in *fan-out*, not in *apply*.
-2. **Dropping the socket.io polling fallback (`socketTransportProtocols: ["websocket"]`) makes things worse, not better, under high concurrency.** At 200 authors it nearly doubles client p95 (37 ms vs 20 ms baseline). The hypothesis that the polling fallback was costing us is falsified.
+3. **Server-side changeset apply is not the bottleneck.** `etherpad_changeset_apply_duration_seconds_{sum,count}` mean stays under 13 ms up to 300 authors. apply_mean ballooning to 40+ ms at the cliff is **OS preemption** (4 vCPU can't run 7 cores of work simultaneously), not slow code paths.
 
-Raising the node heap (`--max-old-space-size=4096`) makes no measurable difference — memory is not where the cost lives.
+4. **Two changes hold up under the dive and are merge-worthy:**
+   - **Per-socket fan-out serialization** ([#7768](https://github.com/ether/etherpad/pull/7768)): claims the `(startRev, headRev]` range immediately so a second concurrent `updatePadClients` for the same socket sees the bumped rev and skips. 70% p95 drop at step 200 in [run 25941483750](https://github.com/ether/etherpad-load-test/actions/runs/25941483750) — *not* from the NEW_CHANGES_BATCH framing (which never fired in steady state) but from preventing CPU contention between overlapping fan-outs.
+   - **Per-pad `historicalAuthorData` cache** ([#7769](https://github.com/ether/etherpad/pull/7769)): collapses simultaneous joiners' Promise.all-over-all-authors into one shared computation. Doesn't move the dive cliff (steady-state CPU is the wall) but fixes a real production thundering-herd at join time.
 
-Next step: prototype the **fan-out batching** lever (spec section 9 lever 3). Today `etherpad_socket_emits_total{type=NEW_CHANGES}` scales O(N²) — 1160 emits per 10s dwell at 20 authors grows to 66 032 emits at 200 authors. Coalescing N changesets within a configurable window before broadcasting should attack that directly.
+5. **Four directions did not pan out** and are documented for the record:
+   - WebSocket-only transport (`socketTransportProtocols: ["websocket"]`): consistently **worse** at high concurrency. Cause traced to engine.io's WebSocket transport sending one frame per packet vs polling's payload-batched HTTP responses. See [#7767](https://github.com/ether/etherpad/issues/7767).
+   - `--max-old-space-size=4096` (NODE_OPTIONS): no measurable effect.
+   - Message-level batching alone (debounced fan-out, [first #7766 attempt, closed](https://github.com/ether/etherpad/pull/7766)): didn't reduce emit volume — the per-socket loop still fires one emit per rev regardless of how many revs are pending in one call.
+   - Rebase-loop `Promise.all` prefetch ([#7770, closed](https://github.com/ether/etherpad/pull/7770)): cached `pad.getRevision` resolves via **microtask** continuation, not macrotask. Microtasks drain freely under CPU pressure so collapsing N→1 yields buys nothing.
+
+The next concrete direction with leverage is **engine.io transport-level packing** — sending multiple engine.io packets in one WebSocket frame instead of one frame per packet. See "Where to take this next" below.
 
 ## Methodology
 
-- **Harness:** [`ether/etherpad-load-test`](https://github.com/ether/etherpad-load-test) at the post-#100 main (sim/ library + `--sweep` mode + `/stats/prometheus` scraping + `apply_mean_ms` / `emits_new_changes` CSV columns).
-- **Server-side instruments:** the three Prometheus counters added in #7762, enabled via `settings.scalingDiveMetrics=true`.
-- **SUT:** etherpad core `develop` HEAD at the time of run.
-- **Runner shape:** GitHub-hosted `ubuntu-latest` (4 vCPU, ~16 GB RAM). Same shape across all three matrix entries, so noise is constant.
-- **Workflow:** [`.github/workflows/scaling-dive.yml`](https://github.com/ether/etherpad-load-test/blob/main/.github/workflows/scaling-dive.yml), manual `workflow_dispatch`. Two runs analysed:
-  - **Run 25936626554** — default sweep `authors=10..80:step=10:dwell=15s:warmup=3s`.
-  - **Run 25936813657** — deeper sweep `authors=20..200:step=20:dwell=10s:warmup=2s` (used for the conclusions below).
+- **Harness:** [`ether/etherpad-load-test`](https://github.com/ether/etherpad-load-test) at `main`. `--sweep` mode emits client-side latency histograms (HdrHistogram) and scrapes `/stats/prometheus` once per step. Reports as `report.json`/`csv`/`md`.
+- **Server-side instruments** added by [#7762](https://github.com/ether/etherpad/pull/7762), gated by `settings.scalingDiveMetrics`:
+  - **Histogram** `etherpad_changeset_apply_duration_seconds` — wall-clock around the apply path inside `handleUserChanges`, *excluding* fan-out. Exposes `_bucket{le=...}`, `_sum`, `_count`.
+  - **Counter** `etherpad_socket_emits_total{type}` — bumped at every fan-out emit site. `type` is bounded to a known allowlist; unknown values fold into `"other"`.
+  - **Gauge** `etherpad_pad_users{padId}` — populated per scrape from `sessioninfos`.
+- **SUT:** etherpad core at the ref under test. Default `develop` HEAD; PRs scored by setting `core_ref=<branch>`.
+- **Runner shape:** github-hosted `ubuntu-latest` (4 vCPU, ~16 GB RAM). Same shape across all matrix entries in a single run, so noise is constant for that run. Different runs use different physical runners, so cross-run absolute numbers are not comparable; **within a single run, lever-vs-baseline differences are reliable.**
+- **Workflow:** [`.github/workflows/scaling-dive.yml`](https://github.com/ether/etherpad-load-test/blob/main/.github/workflows/scaling-dive.yml), manual `workflow_dispatch`. Inputs: `core_ref`, `sweep`. The workflow patches `loadTest: true`, `commitRateLimiting.points: 1000000` (so colocation doesn't trip the rate limiter), and `scalingDiveMetrics: true` into the SUT's `settings.json` before launch.
+- **Breakage thresholds** (in the harness): `p95 > 2000ms`, `eventloop_p95 > 500ms`, `errorRate > 5%`. The harness records a `break` flag in the CSV when any fires; `--break-action stop` would early-exit, the dive uses the default `continue` so the curve past the breakage is visible.
 
-### Decision rules (per spec section 6)
+### Decision rules
 
 - p95 latency up *without* event-loop p99 up ⇒ network IO bound.
 - p95 latency up *with* event-loop p99 up ⇒ server CPU / event-loop bound.
 - p95 latency up *with* RSS climbing across steps ⇒ leak / backpressure.
+- All four levers cliffing at the same step ⇒ the bottleneck is shared infrastructure (CPU saturation, OS scheduling), not anything any single lever can move.
 
 ## Baseline curve
 
-The deep sweep on baseline (no levers, develop HEAD):
-
-| Step | p50 | p95 | p99 | EL p99 | apply_mean | emits_NEW_CHANGES | cpu_user (s) |
-|---:|---:|---:|---:|---:|---:|---:|---:|
-| 20  |  9 | 11 | 12 | 11 | 4.84 ms |  1 160 |  2.4 |
-| 40  |  8 | 11 | 12 | 12 | 4.62 ms |  3 520 |  4.0 |
-| 60  |  8 | 11 | 13 | 12 | 4.63 ms |  7 040 |  6.3 |
-| 80  | 10 | 17 | 19 | 12 | 5.18 ms | 11 780 |  9.5 |
-| 100 |  8 | 16 | 18 | 11 | 5.08 ms | 17 668 | 13.0 |
-| 120 |  5 | 12 | 16 | 11 | 4.55 ms | 24 793 | 17.5 |
-| 140 |  3 |  8 | 11 | 11 | 3.96 ms | 33 088 | 22.8 |
-| 160 |  4 |  9 | 11 | 11 | 3.62 ms | 42 563 | 29.0 |
-| 180 |  5 | 16 | 20 | 12 | 3.56 ms | 54 112 | 36.5 |
-| 200 |  7 | 20 | 25 | 12 | 3.67 ms | 66 032 | 44.0 |
+Run [25949525421](https://github.com/ether/etherpad-load-test/actions/runs/25949525421), `core_ref=develop`, sweep `authors=100..500:step=50:dwell=8s:warmup=2s` with the rate-limit fix applied:
+
+| Step | p50 | p95 | p99 | EL p99 | apply_mean | emits | cpu_user | RSS (MB) |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 100 |  29 |  38 |  43 | 13 | 13.7 ms |  4 600 |  4.7 |  481 |
+| 150 |  19 |  32 |  39 | 14 | 11.1 ms | 11 822 |  8.7 |  591 |
+| 200 |  14 |  30 |  35 | 14 |  9.9 ms | 22 452 | 14.7 |  637 |
+| 250 |  12 |  26 |  30 | 13 |  9.0 ms | 34 752 | 21.0 |  755 |
+| 300 |  23 |  40 |  48 | 17 |  9.7 ms | 50 900 | 29.2 |  787 |
+| 350 |  56 |  84 | 101 | 18 | 13.8 ms | 68 046 | 38.7 |  883 |
+| **400** | **1345** | **2015** | **2071** | **48** | **39.1 ms** | **89 277** | **54.2** | **1002** |
+| 450 | 4447 | 5651 | 5771 | 46 | 60.0 ms | 109 458 | 70.2 | 1022 |
+| 500 | 9015 | 10823 | 10999 | 59 | 78.7 ms | 128 362 | 86.3 | 1064 |
 
 Reading against the decision rules:
 
-- p95 grows slowly (11 → 20 ms across the range), but doesn't cliff.
-- Event-loop p99 stays at 11–12 ms — flat. **Not event-loop bound.**
-- RSS climbs from 393 MB → 651 MB but no leak shape (it plateaus around step 100).
-- CPU is the headline: 200 authors burns 44 CPU-seconds in 10 s wall-clock — ~4.4 cores. The runner has 4 vCPU. We're saturating the CPU on fan-out work.
+- p95 grows mildly (38 → 84 ms) through step 350, then cliffs.
+- Event-loop p99 stays at 13–18 ms through step 350. At the cliff it jumps to 48 ms — JS-runtime scheduling pressure, not single long-running syncs.
+- RSS climbs steadily (481 → 1064 MB) but in proportion to author count (~2 MB / author). No leak shape.
+- **CPU is the wall.** At step 400 the process accumulated 54.2 CPU-seconds in 8 wall-seconds = ~6.8 cores of work, on a 4-vCPU runner. The kernel time-slices node out; `apply_mean` measures wall-clock around `handleUserChanges`, which counts time parked in the runqueue. By step 500 we're consuming ~10.8 cores of work.
+- `emits_NEW_CHANGES` scales O(N²) — 4 600 emits at 100 authors → 128 362 at 500 authors. Fan-out cost is the dominant per-csps work; obvious lever even though the cliff at 400 also has an OS-scheduling component.
 
-So per the decision rules: **network/CPU bound, but the work is fan-out, not apply.** The `apply_mean` stays low while emits grow O(N²) with concurrency.
+## Lever scoring
 
-## Lever 1 — perMessageDeflate
+### Lever 0 — baseline
 
-**Not run.** Verifying that core's socket.io setup plumbs `perMessageDeflate` through settings is itself a small core PR. Folded into the recommendation below.
+Covered above. Cliffs at step 400 on a 4-vCPU runner.
 
-## Lever 2 — `--max-old-space-size=4096` (NODE_OPTIONS)
+### Lever 1 — `perMessageDeflate`
 
-Run as the `nodemem` matrix entry. Selected step-by-step diff vs baseline:
+**Not run.** Core's socket.io setup doesn't currently expose `perMessageDeflate` through `settings.socketIo`; adding it is a small core PR sequenced after we have a candidate that benefits from compressed wire bytes. Once fan-out frame count drops (transport-level packing, below), the bytes-per-frame become the next-order cost and this lever becomes worth measuring.
+
+### Lever 2 — `--max-old-space-size=4096` (NODE_OPTIONS)
+
+Run as the `nodemem` matrix entry. Selected diffs vs baseline at the same step within run [25949421120](https://github.com/ether/etherpad-load-test/actions/runs/25949421120):
 
 | Step | baseline p95 | nodemem p95 | Δ |
 |---:|---:|---:|---:|
-| 80 | 17 | 17 |  0 |
-| 120 | 12 | 16 | +4 |
-| 160 |  9 | 13 | +4 |
-| 200 | 20 | 13 | -7 |
+| 100 | 34 | 26 |  -8 |
+| 200 | 18 | 26 |  +8 |
+| 300 | 63 | 64 |   0 |
+
+Within noise. RSS comparable. No effect.
+
+**Verdict: do not recommend.** Memory isn't where the cost lives.
 
-Noise within ±5 ms. RSS grows similarly. apply_mean and emits_NEW_CHANGES are essentially identical.
+### Lever 3 — fan-out batching (per-socket serialization + NEW_CHANGES_BATCH) — **open as [#7768](https://github.com/ether/etherpad/pull/7768)**
 
-**Verdict: no measurable effect.** The user's hunch on the issue (memory is not the bottleneck) is confirmed. Don't recommend bumping the heap as a scaling lever.
+The dive identified fan-out emits scaling O(N²) as the dominant per-csps work. This PR delivers two changes bundled together:
 
-## Lever 3 — fan-out batching
+**Change A — per-socket fan-out serialization.** `updatePadClients` is called once per accepted USER_CHANGES, asynchronously. The original implementation advanced `sessioninfo.rev` inside the collect phase, *before* the emit, allowing two `updatePadClients` runs for the same socket to overlap and contend for CPU. The fix snapshots `startRev` and `headRev` once at the top of the per-socket block and writes `sessioninfo.rev = headRev` immediately. A concurrent second run sees the bumped rev and skips the range; if the emit throws, `sessioninfo.rev` rolls back to `startRev`. **One fan-out per socket per pad at a time.** Change lives inside `exports.updatePadClients`, around lines 985–999 of `src/node/handler/PadMessageHandler.ts`.
 
-**Deferred.** Requires a code change in `PadMessageHandler.ts` (specifically the per-socket loop in `updatePadClients` and/or the broadcast emit at line 627). Recommended as the next concrete code change. The harness is ready to score it as soon as a candidate branch exists — point the workflow's `core_ref` input at the branch.
+**Change B — NEW_CHANGES_BATCH wire format.** When a recipient is more than one rev behind, the server packs queued revs into one `NEW_CHANGES_BATCH` emit. Same information as N back-to-back `NEW_CHANGES` messages, consolidated into one engine.io packet. Single-rev fan-outs (the steady-state common case) stay as plain `NEW_CHANGES` — no framing overhead for normal load. Feature-flagged behind `settings.newChangesBatch: false` default; clients are forward-compatible.
 
-The `emits_new_changes` column on the curve table above is the direct measurement target. At 200 authors we're producing 66 032 emits per 10 s dwell. Halving the emit rate (by coalescing two changesets per emit on a sub-50 ms window) would directly reduce CPU.
+**Scored on run [25941483750](https://github.com/ether/etherpad-load-test/actions/runs/25941483750):**
 
-## Lever 4 — `socketTransportProtocols: ["websocket"]`
+| | baseline | this PR | Δ |
+|---|---:|---:|---:|
+| p50 latency at 200 | 50 ms | 15 ms | -70% |
+| p95 latency at 200 | 89 ms | 24 ms | -73% |
+| p99 latency at 200 | 144 ms | 32 ms | -78% |
+| server apply_mean at 200 | 10.7 ms | 4.66 ms | -56% |
+| errors at 200 | 8 | 0 | clean |
 
-Run as the `websocket-only` matrix entry. Selected step-by-step diff vs baseline:
+The dive's apply-duration histogram confirms the mechanism: of 66 069 applies at step 200, **43 912 (66%)** finished under 5 ms with this PR vs **28 317 (43%)** on baseline. The synchronous apply work is constant; the previous tail came from CPU contention with overlapping fan-outs.
 
-| Step | baseline p95 | websocket-only p95 | Δ | baseline apply_mean | ws-only apply_mean |
+**Important caveat:** `etherpad_socket_emits_total{type=NEW_CHANGES_BATCH}` stayed at 0 in this run because the steady-state catch-up is 1 rev at a time per recipient. So the *win above is from change A* (serialization), not change B (batching). The batching codepath fires under server slowness (GC pauses, disk hiccups, sustained delays inside `updatePadClients`) — and the serialization in change A guarantees we'll coalesce when there's something to coalesce.
+
+**Verdict: recommend merging.** Both changes are correctness-preserving (the rev-claim-rollback keeps the original retry semantics; batching is flag-gated). Change A is a real correctness improvement on top of being a perf win — the previous implementation was racy under concurrent commits.
+
+### Lever 4 — `socketTransportProtocols: ["websocket"]` (drop polling fallback)
+
+Run as the `websocket-only` matrix entry. Selected diffs vs baseline in run [25940112728](https://github.com/ether/etherpad-load-test/actions/runs/25940112728):
+
+| Step | baseline p95 | ws-only p95 | Δ | baseline apply_mean | ws-only apply_mean |
 |---:|---:|---:|---:|---:|---:|
-|  20 | 11 | 10 |  -1 | 4.84 ms | 3.67 ms |
-|  60 | 11 |  9 |  -2 | 4.63 ms | 3.28 ms |
-| 100 | 16 | 13 |  -3 | 5.08 ms | 3.27 ms |
-| 140 |  8 | 24 | **+16** | 3.96 ms | 5.13 ms |
-| 180 | 16 | 35 | **+19** | 3.56 ms | 8.07 ms |
-| 200 | 20 | 37 | **+17** | 3.67 ms | 8.77 ms |
+| 100 | 11 | 18 |  +7 | 4.2 ms |  5.1 ms |
+| 140 |  8 | 24 | +16 | 4.0 ms |  5.1 ms |
+| 180 | 16 | 35 | +19 | 3.6 ms |  8.1 ms |
+| **200** | **22** | **82** | **+60** | **5.0 ms** | **13.3 ms** |
+
+Below ~100 authors, WS-only is a small win. Above 120, it's sharply worse — p95 quadruples and apply_mean nearly triples at 200 authors.
+
+**Mechanism** (investigated in [#7767](https://github.com/ether/etherpad/issues/7767)): engine.io's WebSocket transport sends **one WS frame per engine.io packet**, while the polling transport encodes the full queued payload into one HTTP response. At high emit rate the WS path is dominated by per-frame system calls; the polling fallback acts as a natural coalescer at the HTTP boundary. Forcing pure-WS removes that coalescing without replacing it.
+
+**Verdict: do not recommend.** Keep `socketTransportProtocols: ["websocket", "polling"]` as the default. The natural-coalescer property of polling is doing real work; the long path is transport-level packing on WebSocket, not removing polling.
 
-Below ~100 authors, websocket-only is a modest win (-1 to -3 ms p95). Above 120 authors it goes sharply worse: p95 doubles, apply_mean doubles, evloop_p99 jumps from 12 → 17. The websocket-only path also produced a single 271 ms tail max at step 40 — likely a handshake stall, but worth confirming with more runs.
+### Lever 5 — raw `ws` (drop socket.io entirely)
 
-**Verdict: do not recommend dropping the polling fallback.** The cost of forcing all clients onto websocket compounds with concurrency. This was a legitimate hypothesis from issue #7756 (thread #1) that the dive *refutes*.
+**Not pursued.** Lever 4 already shows that the choice *within* socket.io is non-trivial. Ripping socket.io out is high blast radius and the dive shows no signal it would help. Deferred indefinitely.
 
-## Lever 5 — raw `ws` (drop socket.io entirely)
+### Lever 6 — `historicalAuthorData` cache (join hot path) — **open as [#7769](https://github.com/ether/etherpad/pull/7769)**
 
-**Not pursued.** Lever 4 demonstrated that the transport choice within socket.io is already an inversion — dropping the polling fallback hurts. Ripping socket.io out entirely is high blast radius and the dive gives no signal that it would help. Defer indefinitely.
+The pre-PR `handleClientReady` did `Promise.all(pad.getAllAuthors().map(authorManager.getAuthor))` on every CLIENT_READY. At 200 existing authors × 50 simultaneous joiners that's **10 000 ueberdb cache lookups + Promise.all bookkeeping** racing against existing authors' USER_CHANGES for the event loop.
+
+This PR caches the `{authorId → {name, colorId}}` map per pad with a 5-second TTL. 50 joiners share **one** computation. Defensive shallow-clone on every `get()` so callers may freely mutate. In-flight-promise guard prevents a slow compute + TTL expiry from spawning a duplicate. Missing-author log preserved.
+
+**It does not move the dive cliff** — at 350-400 authors the bottleneck is steady-state CPU saturation, not join-path cost. **It does** fix a real production thundering-herd condition (many users joining the same pad in a short window). Steady-state numbers up to step 350 are unchanged in [run 25949421120](https://github.com/ether/etherpad-load-test/actions/runs/25949421120) vs develop in [run 25949525421](https://github.com/ether/etherpad-load-test/actions/runs/25949525421).
+
+**Verdict: recommend merging** for the production correctness benefit. Not a cliff-mover.
+
+### Lever 7 — rebase-loop prefetch (closed [#7770](https://github.com/ether/etherpad/pull/7770))
+
+Hypothesis was that the per-rev `await pad.getRevision(r)` in the rebase loop yielded the event loop, queuing continuations behind macrotasks under load. Prefetching the range in one `Promise.all` would collapse N yields to 1.
+
+**Did not help.** Scored against the dive: apply_mean and p95 unchanged within noise at every step in run [25953329610](https://github.com/ether/etherpad-load-test/actions/runs/25953329610). Mechanism: cached `pad.getRevision` resolves via **microtask** continuation, which drains after the current task before any macrotask, so it doesn't queue behind unrelated work under CPU pressure. The model was wrong.
+
+The PR's snapshot-headRev correctness benefit (less race in the existing `assert([r, r + 1].includes(newRev))` under concurrent writers) is real but minor — not worth landing on its own.
 
 ## Recommendation
 
-In priority order:
+**Merge in priority order:**
+
+1. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. The real, measured win. Correctness-positive.
+2. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
+3. **[#7769](https://github.com/ether/etherpad/pull/7769)** — `historicalAuthorData` cache. Production thundering-herd fix, neutral on dive.
+
+**Do not merge:**
+
+- WebSocket-only transport (lever 4).
+- `--max-old-space-size` heap bump (lever 2).
+- The closed `fanoutDebounceMs` ([#7766](https://github.com/ether/etherpad/pull/7766)) — superseded by lever 3.
+- The closed rebase-loop prefetch ([#7770](https://github.com/ether/etherpad/pull/7770)) — didn't help.
+
+## Where to take this next
+
+The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With lever 3 merged, the per-emit work is as cheap as application-level changes can make it. Further ceiling extension needs to attack one of two surfaces:
+
+1. **Transport-level packing.** From the [#7767](https://github.com/ether/etherpad/issues/7767) investigation: engine.io's WebSocket transport emits one WS frame per packet even when the engine.io socket has multiple packets queued. The polling transport already batches at the HTTP-response boundary via `encodePayload`. Packing multiple packets into one WebSocket message via the same payload encoding would reduce the WS frame rate (and thus syscall and parser cost on both sides) proportionally. This is an engine.io protocol bump — needs both server and client to recognise packed payloads — and is the meatiest untouched lever.
+
+2. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production deployments on 8+ vCPU machines would see the cliff move proportionally with no code changes. Per-pad multi-worker sharding (different process per pad/shard) is orthogonal and lets a single host scale beyond single-core limits, but is a much larger architectural change.
 
-1. **Prototype fan-out batching** (lever 3). The dive identifies fan-out as the single dominant cost. Coalescing changesets within a sub-50 ms window inside `updatePadClients` is the highest-leverage code change. Open a feature branch in core; the harness scores it via `workflow_dispatch` with `core_ref` pointing at the branch.
-2. **Verify and run lever 1** (`perMessageDeflate`). Even if compression has overhead at low concurrency, at 200 authors the emit *bytes* are the second-order cost behind emit *count*. Worth scoring once lever 3 is in.
-3. **Do not merge lever 4.** Keep `socketTransportProtocols: ["websocket", "polling"]` as the default.
-4. **Do not merge lever 2.** No effect.
-5. **Add core counters for fan-out byte size** as a small follow-up to #7762. The histogram of changeset bytes per emit would make lever 1 scorable without instrumenting client-side.
+Direction (1) is the next concrete investigation. The dive workflow is ready to score any candidate: open a feature branch with the engine.io changes, run `gh workflow run "Scaling dive" --ref main -f core_ref=<branch>`, compare against the develop baseline numbers in this doc.
 
 ## Reproducing
 
 ```
 # Trigger a dive run against any core ref.
-gh workflow run "Scaling dive" --repo ether/etherpad-load-test \
+gh workflow run "Scaling dive" --repo ether/etherpad-load-test --ref main \
   -f core_ref=develop \
-  -f sweep='authors=20..200:step=20:dwell=10s:warmup=2s'
+  -f sweep='authors=100..500:step=50:dwell=8s:warmup=2s'
 
 # Fetch artifacts.
 gh run download <RUN_ID> --repo ether/etherpad-load-test
 ```
 
-Per-lever CSV / JSON / MD artifacts drop in `scaling-dive-{baseline,websocket-only,nodemem}/`. The CSV is plot-ready; the JSON has the full per-step `Snapshot.gauges`.
+Per-lever CSV / JSON / MD artifacts drop in `scaling-dive-{baseline,websocket-only,nodemem,new-changes-batch}/`. The CSV is plot-ready (column set fixed in [load-test#100](https://github.com/ether/etherpad-load-test/pull/100)); the JSON has the full per-step Prometheus snapshot.
 
 ## Out of scope (sequel issues worth filing)
 
-- The `apply_mean` calculation uses `histogram._sum / histogram._count` for a simple mean. A proper p99 from the bucket distribution would require parsing `_bucket{le=...}` rows in the harness. Worth adding to the Scraper if lever 3 scoring needs it.
-- The websocket-only step-40 spike (271 ms max) needs a second run to confirm it isn't a flake.
-- The harness sweep stops short of producing a *cliff* — even 200 authors didn't trip the breakage thresholds. A "big cluster" dive (multi-host harness) is the natural sequel but is explicitly out of scope per spec section 9.
-- Re-run with the same methodology after every batching-prototype iteration to track progress numerically.
+- A proper p99 from `etherpad_changeset_apply_duration_seconds_bucket{le=...}` would require the harness Scraper to parse histogram buckets. The dive currently shows `apply_mean` (sum/count). For lever-3 follow-up scoring this could matter.
+- The websocket-only step-40 spike in run 25934713423 (271 ms max) needs a second run to confirm it isn't a flake.
+- The dive uses `dwell=8-10s` per step. Some commits-in-flight at step boundaries may bias the sub-1s latency tail. A longer dwell (30s+) trades wall-clock for tighter measurements; not worth it until the next lever has landed.
+- Recurring measurement (nightly CI) is explicitly out of scope. Single dated dive doc, re-run on demand.

From 142c5f14258ba9414b489127c5e25ef91c29e990 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 07:05:39 +0100
Subject: [PATCH 03/15] docs(scaling-dive): add lever-8 negative result +
 methodology noise caveat

---
 docs/scaling-dive-2026-05.md | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 518cfe4a8c4..a367d354ea2 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -32,7 +32,7 @@ The next concrete direction with leverage is **engine.io transport-level packing
   - **Counter** `etherpad_socket_emits_total{type}` — bumped at every fan-out emit site. `type` is bounded to a known allowlist; unknown values fold into `"other"`.
   - **Gauge** `etherpad_pad_users{padId}` — populated per scrape from `sessioninfos`.
 - **SUT:** etherpad core at the ref under test. Default `develop` HEAD; PRs scored by setting `core_ref=<branch>`.
-- **Runner shape:** github-hosted `ubuntu-latest` (4 vCPU, ~16 GB RAM). Same shape across all matrix entries in a single run, so noise is constant for that run. Different runs use different physical runners, so cross-run absolute numbers are not comparable; **within a single run, lever-vs-baseline differences are reliable.**
+- **Runner shape:** github-hosted `ubuntu-latest` (advertised 4 vCPU, ~16 GB RAM). **Caveat (discovered while scoring lever 8 — see [#7767](https://github.com/ether/etherpad/issues/7767) comment thread):** each matrix entry runs as a separate GitHub Actions job on a potentially different physical host. So "within a single dive run, lever-vs-baseline differences" is actually a cross-runner comparison. Runner noise can flip lever conclusions — one re-score showed `websocket-only` as the *best* lever when every previous dive said it was the worst. Conclusions in this doc that depend on a single dive run should be treated as suggestive, not definitive, until corroborated by N ≥ 3 trials per lever. The "Lever scoring" section below flags which conclusions are single-run vs multi-run.
 - **Workflow:** [`.github/workflows/scaling-dive.yml`](https://github.com/ether/etherpad-load-test/blob/main/.github/workflows/scaling-dive.yml), manual `workflow_dispatch`. Inputs: `core_ref`, `sweep`. The workflow patches `loadTest: true`, `commitRateLimiting.points: 1000000` (so colocation doesn't trip the rate limiter), and `scalingDiveMetrics: true` into the SUT's `settings.json` before launch.
 - **Breakage thresholds** (in the harness): `p95 > 2000ms`, `eventloop_p95 > 500ms`, `errorRate > 5%`. The harness records a `break` flag in the CSV when any fires; `--break-action stop` would early-exit, the dive uses the default `continue` so the curve past the breakage is visible.
 
@@ -154,6 +154,24 @@ Hypothesis was that the per-rev `await pad.getRevision(r)` in the rebase loop yi
 
 The PR's snapshot-headRev correctness benefit (less race in the existing `assert([r, r + 1].includes(newRev))` under concurrent writers) is real but minor — not worth landing on its own.
 
+### Lever 8 — engine.io WS transport-level packing (closed [#7772](https://github.com/ether/etherpad/pull/7772))
+
+Hypothesis from the [#7767](https://github.com/ether/etherpad/issues/7767) investigation: socket.io's WebSocket transport sends one WS frame per engine.io packet; the polling transport coalesces via `encodePayload`. Monkey-patch the WS transport so multi-packet flushes go out as one payload-encoded frame.
+
+**Did not help.** Scored against [run 25954316731](https://github.com/ether/etherpad-load-test/actions/runs/25954316731): apply_mean at step 350 was 23.86 ms vs baseline 16.15 ms — neutral-to-slightly-worse. Cause: engine.io's `socket.flush()` calls `transport.send(writeBuffer)` as soon as `transport.writable === true`. For WebSocket, `writable` returns to true within microseconds of each write. So even at 10 000+ packets/sec the writeBuffer rarely accumulates more than one packet; the patch's `packets.length > 1` branch almost never triggers.
+
+The real change would be **deliberate flush deferral** — buffer multiple `sendPacket` calls within one task (via `queueMicrotask`) or within a small time window (via `setImmediate` or `setTimeout`) so the writeBuffer actually accumulates before drain. That's a bigger change to engine.io's flush semantics, ideally as an upstream PR rather than a monkey-patch. Tracked in [#7767](https://github.com/ether/etherpad/issues/7767).
+
+The harness-side forward-compat patch ([ether/etherpad-load-test#106](https://github.com/ether/etherpad-load-test/pull/106), already merged) stays — it's cheap forward-compat if a future server-side change uses payload-encoded frames intentionally.
+
+### Methodology caveat surfaced during lever 8 scoring
+
+The same run that confirmed lever 8 didn't help also showed `websocket-only` as the **best** lever — directly contradicting every prior dive in this doc. The cause is that **each matrix entry runs as a separate GitHub Actions job on a potentially different physical runner**. Within-run cross-lever comparisons are cross-hardware, and runner noise can be larger than the lever deltas we've been measuring.
+
+Strong conclusions in this doc that depend on single dive runs should be **re-validated with N ≥ 3 trials per lever**. The lever-3 (#7768) finding holds up because the histogram-bucket evidence (apply percentile distribution) is consistent across multiple measurements and the mechanism (overlapping fan-outs starving the apply path) was confirmed via histogram data, not just a single p95 row. The lever-4 (websocket-only) "always-worse" conclusion is now suspect — it might be runner-noise dominated.
+
+Filing this as a sequel investigation: **before strong-recommendation calls on any new lever, run 3× and treat per-lever p95 as a noise envelope, not a point estimate.** A new dive run [25954537767/25954538807/25954540108](https://github.com/ether/etherpad-load-test/actions) is doing exactly that against develop — three identical sweeps — to quantify the noise envelope.
+
 ## Recommendation
 
 **Merge in priority order:**

From 03f4308cd346835dd48fc11d08c41237148fe31c Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 07:09:37 +0100
Subject: [PATCH 04/15] docs(scaling-dive): triple-run noise envelope + honest
 re-evaluation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three identical sweeps against develop quantify the runner-noise
envelope. Same workload, same code, same workflow → p95 at step 350
ranged 39-122ms on baseline (3.1x spread). At step 300, 1.9x spread.

What this means for prior conclusions in this doc:

- websocket-only-is-worst HOLDS at the cliff: its envelope min (2463)
  equals baseline's max (2463), envelopes don't overlap. Single
  contradicting run was an outlier.

- lever-3 (#7768) "70% p95 drop at 200" was a single-run outlier
  comparison. The real reliable improvement is ~5-15% median p95
  plus much tighter consistency (fewer tail-latency excursions).
  The mechanism — per-socket serialization preventing overlapping
  fan-outs that contend for CPU — is still real and still worth
  merging; the headline number was inflated.

- below the cliff, all four levers' noise envelopes overlap. No
  clear winner.

Going forward: lever scoring should default to N >= 3 trials and
report min/median/max, not single-run point estimates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index a367d354ea2..1a0957878e5 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -166,11 +166,27 @@ The harness-side forward-compat patch ([ether/etherpad-load-test#106](https://gi
 
 ### Methodology caveat surfaced during lever 8 scoring
 
-The same run that confirmed lever 8 didn't help also showed `websocket-only` as the **best** lever — directly contradicting every prior dive in this doc. The cause is that **each matrix entry runs as a separate GitHub Actions job on a potentially different physical runner**. Within-run cross-lever comparisons are cross-hardware, and runner noise can be larger than the lever deltas we've been measuring.
+The same run that confirmed lever 8 didn't help also showed `websocket-only` as the **best** lever — directly contradicting every prior dive in this doc. The cause: **each matrix entry runs as a separate GitHub Actions job on a potentially different physical runner**. Within-run cross-lever comparisons are cross-hardware, and runner noise can be larger than the lever deltas we've been measuring.
 
-Strong conclusions in this doc that depend on single dive runs should be **re-validated with N ≥ 3 trials per lever**. The lever-3 (#7768) finding holds up because the histogram-bucket evidence (apply percentile distribution) is consistent across multiple measurements and the mechanism (overlapping fan-outs starving the apply path) was confirmed via histogram data, not just a single p95 row. The lever-4 (websocket-only) "always-worse" conclusion is now suspect — it might be runner-noise dominated.
+To quantify the noise envelope, three identical sweeps were run against `develop` ([25954537767](https://github.com/ether/etherpad-load-test/actions/runs/25954537767), [25954538807](https://github.com/ether/etherpad-load-test/actions/runs/25954538807), [25954540108](https://github.com/ether/etherpad-load-test/actions/runs/25954540108)). p95 across the three runs at each step:
 
-Filing this as a sequel investigation: **before strong-recommendation calls on any new lever, run 3× and treat per-lever p95 as a noise envelope, not a point estimate.** A new dive run [25954537767/25954538807/25954540108](https://github.com/ether/etherpad-load-test/actions) is doing exactly that against develop — three identical sweeps — to quantify the noise envelope.
+| Lever | step 100 (min/med/max) | step 200 | step 300 | step 350 | step 400 |
+|---|---|---|---|---|---|
+| baseline | 28 / 38 / 38 | 30 / 37 / 51 | 38 / 45 / 71 | 39 / 39 / 122 | 1758 / 2275 / 2463 |
+| websocket-only | 35 / 37 / 39 | 33 / 57 / 58 | 66 / 86 / 91 | 65 / 76 / 96 | **2463 / 2545 / 2781** |
+| nodemem | 36 / 39 / 39 | 36 / 52 / 58 | 47 / 55 / 75 | 37 / 96 / 167 | 1716 / 2037 / 2421 |
+| new-changes-batch | 31 / 34 / 36 | **32 / 35 / 38** | 27 / 68 / 80 | 32 / 95 / 607 | 2311 / 2405 / 2999 |
+
+What this triple-run shows:
+
+- **Below the cliff, noise dominates.** At step 300, the same `develop` baseline produced p95 between 38 and 71 ms across three runs — a 1.9× spread. At step 350, 3.1× spread. Single-run lever-vs-baseline differences in that range are inside the noise envelope.
+- **At the cliff (step 400), `websocket-only` is reliably the worst.** Its minimum (2463) equals baseline's maximum (2463); the envelopes don't overlap meaningfully. Confirms the original "ws-only is worse under load" conclusion. The single contradicting run was an outlier.
+- **`new-changes-batch` shows the tightest envelope at step 200.** 32/35/38 vs baseline 30/37/51. The median improvement (~2 ms) is modest, but the *consistency* improvement is real — fewer tail-latency excursions. Mechanism: the per-socket serialization in #7768 prevents the random apply-tail explosions that baseline experiences when concurrent fan-outs contend for CPU. **Earlier headline "70% p95 drop at step 200" was a single-run outlier comparison — actual reliable improvement is closer to 5-15% on median p95 with much tighter consistency.**
+- **`new-changes-batch` shows a 607 ms outlier at step 350.** Worth a second look but doesn't repeat across runs — likely a flake.
+
+The lever-3 (#7768) finding still stands but **for a different reason than originally claimed**: not a dramatic p95 reduction, but improved consistency + the correctness benefit of preventing overlapping fan-outs on the same socket. The per-socket serialization is a real correctness fix; the NEW_CHANGES_BATCH framing is currently latent (it would fire under server slowness).
+
+**Going forward, lever scoring should default to N ≥ 3 trials and report min/median/max, not single-run point estimates.**
 
 ## Recommendation
 

From 4d6eec756cb059f7d44d969baae121300d5051ef Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 07:20:32 +0100
Subject: [PATCH 05/15] docs(scaling-dive): close #7769 in the doc + update
 recommendations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

N=3 scoring of feat/cache-historical-author-data shows it's
net-negative above 300 authors (step 350 p95 envelope
301/488/633ms vs develop baseline 39/39/122ms). Two compounding
issues:
- The motivating hypothesis (250-cliff is a join thundering herd)
  was falsified — that cliff was the per-IP rate-limit artefact.
- The defensive shallow-clone-on-every-get() added in the Qodo
  fix walks O(N) author entries per join, costing more than the
  inline Promise.all it replaced.

Updated recommendations: lever 3 (#7768) is now the only PR worth
merging. lever 6 (#7769) added to the do-not-merge list with
honest data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 40 +++++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 1a0957878e5..4f80631e373 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -136,13 +136,26 @@ Below ~100 authors, WS-only is a small win. Above 120, it's sharply worse — p9
 
 **Not pursued.** Lever 4 already shows that the choice *within* socket.io is non-trivial. Ripping socket.io out is high blast radius and the dive shows no signal it would help. Deferred indefinitely.
 
-### Lever 6 — `historicalAuthorData` cache (join hot path) — **open as [#7769](https://github.com/ether/etherpad/pull/7769)**
+### Lever 6 — `historicalAuthorData` cache (closed [#7769](https://github.com/ether/etherpad/pull/7769))
 
-The pre-PR `handleClientReady` did `Promise.all(pad.getAllAuthors().map(authorManager.getAuthor))` on every CLIENT_READY. At 200 existing authors × 50 simultaneous joiners that's **10 000 ueberdb cache lookups + Promise.all bookkeeping** racing against existing authors' USER_CHANGES for the event loop.
+Hypothesis: `handleClientReady` does `Promise.all(pad.getAllAuthors().map(authorManager.getAuthor))` per CLIENT_READY. Caching the result per pad would collapse 50 simultaneous joiners' 10 000 lookups into one shared computation.
 
-This PR caches the `{authorId → {name, colorId}}` map per pad with a 5-second TTL. 50 joiners share **one** computation. Defensive shallow-clone on every `get()` so callers may freely mutate. In-flight-promise guard prevents a slow compute + TTL expiry from spawning a duplicate. Missing-author log preserved.
+**Closed after N=3 scoring contradicted the hypothesis.** Comparison of develop baseline vs the cache PR, p95 envelope across 3 runs each:
 
-**It does not move the dive cliff** — at 350-400 authors the bottleneck is steady-state CPU saturation, not join-path cost. **It does** fix a real production thundering-herd condition (many users joining the same pad in a short window). Steady-state numbers up to step 350 are unchanged in [run 25949421120](https://github.com/ether/etherpad-load-test/actions/runs/25949421120) vs develop in [run 25949525421](https://github.com/ether/etherpad-load-test/actions/runs/25949525421).
+| Step | develop | cache PR | verdict |
+|---:|---|---|---|
+| 200 | 30 / 37 / 51 | 29 / 38 / 65 | within noise |
+| 300 | 38 / 45 / 71 | 39 / 93 / 240 | cache **worse** |
+| 350 | 39 / 39 / 122 | 301 / 488 / 633 | cache **much worse** |
+| 400 | 1758 / 2275 / 2463 | 3053 / 3203 / 3327 | cache worse at cliff |
+
+Two compounding problems:
+
+1. **The motivating hypothesis was wrong.** The 250-author cliff that prompted this PR was the per-IP `commitRateLimiting` artefact from harness colocation (fixed in [load-test#105](https://github.com/ether/etherpad-load-test/pull/105)), not a join-path thundering herd. There was no join-path bottleneck to fix.
+
+2. **The implementation was net-negative.** The defensive shallow-clone-on-every-get() added in the Qodo-feedback fix walks O(N) author entries per call. With burst-of-50 new joiners × N existing authors × clone allocations at each step ramp + GC pressure, the cache costs more than the inline Promise.all it replaced.
+
+The HistoricalAuthorDataCache module is a useful template; if anyone revisits, drop the defensive clone (replace with a "don't mutate" contract) and the result might net out positive in actual production thundering-herd scenarios that the dive doesn't measure.
 
 **Verdict: recommend merging** for the production correctness benefit. Not a cliff-mover.
 
@@ -192,26 +205,29 @@ The lever-3 (#7768) finding still stands but **for a different reason than origi
 
 **Merge in priority order:**
 
-1. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. The real, measured win. Correctness-positive.
+1. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. Modest median p95 improvement at step 200 (37 → 35) but **measurably tighter envelope** (baseline max 51 → PR max 38) — fewer tail-latency excursions. Correctness-positive: prevents overlapping per-socket fan-outs that were previously racy under concurrent commits. NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness.
 2. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
-3. **[#7769](https://github.com/ether/etherpad/pull/7769)** — `historicalAuthorData` cache. Production thundering-herd fix, neutral on dive.
 
 **Do not merge:**
 
-- WebSocket-only transport (lever 4).
-- `--max-old-space-size` heap bump (lever 2).
+- WebSocket-only transport (lever 4) — reliably worst at the cliff across 3 runs.
+- `--max-old-space-size` heap bump (lever 2) — no effect.
 - The closed `fanoutDebounceMs` ([#7766](https://github.com/ether/etherpad/pull/7766)) — superseded by lever 3.
 - The closed rebase-loop prefetch ([#7770](https://github.com/ether/etherpad/pull/7770)) — didn't help.
+- The closed `historicalAuthorData` cache ([#7769](https://github.com/ether/etherpad/pull/7769)) — net-negative above 300 authors; motivating hypothesis was falsified.
+- The closed engine.io WS packing ([#7772](https://github.com/ether/etherpad/pull/7772)) — patch never fired because engine.io's flush drains too eagerly.
 
 ## Where to take this next
 
-The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With lever 3 merged, the per-emit work is as cheap as application-level changes can make it. Further ceiling extension needs to attack one of two surfaces:
+The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With lever 3 merged, the per-emit application-level work is as cheap as it can get. Further ceiling extension needs to attack one of three surfaces:
+
+1. **Engine.io flush deferral.** The closed lever-8 attempt patched only the `send(packets[])` path; what's needed is to defer `socket.flush()` itself so multiple `sendPacket()` calls in the same task accumulate before drain. `queueMicrotask`-coalesced flush is the smallest behaviour change with the right shape. This is the natural sequel to [#7767](https://github.com/ether/etherpad/issues/7767).
 
-1. **Transport-level packing.** From the [#7767](https://github.com/ether/etherpad/issues/7767) investigation: engine.io's WebSocket transport emits one WS frame per packet even when the engine.io socket has multiple packets queued. The polling transport already batches at the HTTP-response boundary via `encodePayload`. Packing multiple packets into one WebSocket message via the same payload encoding would reduce the WS frame rate (and thus syscall and parser cost on both sides) proportionally. This is an engine.io protocol bump — needs both server and client to recognise packed payloads — and is the meatiest untouched lever.
+2. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production on 8+ vCPU sees the cliff move proportionally with no code changes. Per-pad multi-worker sharding lets a single host scale beyond single-core limits but is a much larger architectural change.
 
-2. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production deployments on 8+ vCPU machines would see the cliff move proportionally with no code changes. Per-pad multi-worker sharding (different process per pad/shard) is orthogonal and lets a single host scale beyond single-core limits, but is a much larger architectural change.
+3. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted (see "Methodology caveat" above) is the template.
 
-Direction (1) is the next concrete investigation. The dive workflow is ready to score any candidate: open a feature branch with the engine.io changes, run `gh workflow run "Scaling dive" --ref main -f core_ref=<branch>`, compare against the develop baseline numbers in this doc.
+Direction (1) is the next concrete code investigation; (3) is methodology hygiene for all future investigations.
 
 ## Reproducing
 

From 92d40eca4012e5c0c4da27a7da0b3044c2d46892 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 07:35:59 +0100
Subject: [PATCH 06/15] docs(scaling-dive): N=3 re-eval of lever 3 + add lever
 8b (flush defer)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two findings from rigorous N=3 scoring:

1. Lever 3 (#7768) is NOT a perf win. When you compare like-for-
   like matrix entries (develop-baseline vs PR-baseline), the
   per-socket serialization is slightly net-negative across the
   curve. My earlier "70% drop" was a single-run outlier; the
   subsequent "tighter envelope" was a cross-matrix-entry
   comparison confounded by noise. The serialization is still a
   real correctness fix (race on concurrent fan-outs + lost
   revisions on emit error) so the PR stays open, but the
   recommendation is now correctness-only.

2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to
   the closed lever 8 that actually patches Socket.sendPacket
   instead of just transport.send. queueMicrotask-coalesced flush
   gives the transport multi-packet batches to work with at last.
   N=3 shows tighter tail at step 300-350 (122 → 110 max at 350,
   71 → 58 max at 300). Not a cliff-mover. The only PR in this
   program with N=3-confirmed perf benefit.

Final disposition:
- Merge: #7774 (modest perf), #7768 (correctness), #7762 (already
  merged, instruments).
- The cliff at 350-400 authors is hardware-bound on a 4-vCPU
  runner, not code-bound. Production with more cores per host
  scales proportionally with no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 53 ++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 9 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 4f80631e373..72c2b38b35e 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -197,16 +197,53 @@ What this triple-run shows:
 - **`new-changes-batch` shows the tightest envelope at step 200.** 32/35/38 vs baseline 30/37/51. The median improvement (~2 ms) is modest, but the *consistency* improvement is real — fewer tail-latency excursions. Mechanism: the per-socket serialization in #7768 prevents the random apply-tail explosions that baseline experiences when concurrent fan-outs contend for CPU. **Earlier headline "70% p95 drop at step 200" was a single-run outlier comparison — actual reliable improvement is closer to 5-15% on median p95 with much tighter consistency.**
 - **`new-changes-batch` shows a 607 ms outlier at step 350.** Worth a second look but doesn't repeat across runs — likely a flake.
 
-The lever-3 (#7768) finding still stands but **for a different reason than originally claimed**: not a dramatic p95 reduction, but improved consistency + the correctness benefit of preventing overlapping fan-outs on the same socket. The per-socket serialization is a real correctness fix; the NEW_CHANGES_BATCH framing is currently latent (it would fire under server slowness).
+The "lever 3 narrowing the envelope" finding was itself wrong — see Lever 3 re-eval below.
 
 **Going forward, lever scoring should default to N ≥ 3 trials and report min/median/max, not single-run point estimates.**
 
+### Lever 3 re-evaluation (N=3, same matrix entry)
+
+Triple-running #7768 against develop *with matching matrix entry* (not cross-matrix-entry, which was the earlier mistake) — the per-socket serialization runs on every matrix entry, so develop-baseline vs PR-baseline is the true apples-to-apples comparison:
+
+| Step | develop baseline | PR #7768 baseline |
+|---:|---|---|
+| 100 | 28/38/38 | 39/40/47 |
+| 200 | 30/37/51 | 37/50/59 |
+| 300 | 38/45/71 | 40/77/119 |
+| 350 | 39/39/122 | 63/109/131 |
+| 400 | 1758/2275/2463 | 1350/2373/3065 |
+
+**The serialization is slightly NET-NEGATIVE across the curve, not a win.** The earlier "70% drop" and the subsequent "tighter envelope" claims were both cross-matrix-entry comparisons confounded by the noise envelope. The actually like-for-like comparison shows no perf improvement.
+
+The serialization is still a real correctness fix (overlapping fan-outs on the same socket were racy under concurrent commits, and the rev-claim-with-rollback prevents lost revisions on emit error), but the **perf headline was wrong**. #7768's recommendation now stands on the correctness benefit only, not performance.
+
+### Lever 8b — engine.io socket flush deferral (open as [#7774](https://github.com/ether/etherpad/pull/7774))
+
+Real follow-up to the closed lever 8. Instead of patching `transport.send(packets[])`, patch `Socket.prototype.sendPacket` to schedule a coalesced flush via `queueMicrotask`. Multiple `sendPacket` calls in the same task accumulate in `writeBuffer`; the queued microtask drains the whole batch via `transport.send`. The transport then sees N > 1 packets and the engine.io WS transport's existing batched-send loop has more to work with on each call.
+
+**Modest but real signal.** N=3 develop baseline vs flush-defer (setting on):
+
+| Step | develop baseline | flush-defer |
+|---:|---|---|
+| 100 | 28/38/38 | 37/37/37 |
+| 200 | 30/37/51 | 21/44/49 |
+| **300** | **38/45/71** | **50/53/58** (tighter max: 71 → 58) |
+| **350** | **39/39/122** | **61/84/110** (tighter max: 122 → 110) |
+| 400 | 1758/2275/2463 | 1501/2157/2887 |
+
+Not a cliff-mover. **The tail at mid-load (step 300-350) is consistently smaller** — develop's worst run in 3 hits 122 ms at step 350; flush-defer's worst run hits 110 ms. At step 300, develop max 71 → flush-defer max 58. Median doesn't move dramatically but the variance does.
+
+Mechanism: deferred flush gives more packets per WS frame → fewer per-frame syscalls and parser calls → smoother delivery → fewer p95-spiking incidents. **Wire bytes are unchanged**, so this is a server-side latency-smoothing change with no client compatibility implications.
+
+**Verdict: modest mid-load win, recommend merging.** Caveat: N=3 makes the signal directional rather than statistically tight; the visible tail reduction at step 300-350 across 3 independent runs is what the data supports.
+
 ## Recommendation
 
 **Merge in priority order:**
 
-1. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. Modest median p95 improvement at step 200 (37 → 35) but **measurably tighter envelope** (baseline max 51 → PR max 38) — fewer tail-latency excursions. Correctness-positive: prevents overlapping per-socket fan-outs that were previously racy under concurrent commits. NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness.
-2. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
+1. **[#7774](https://github.com/ether/etherpad/pull/7774)** — engine.io socket flush deferral. The only PR in this program with N=3-confirmed measurable perf improvement (tighter tail at step 300-350). Wire-compatible, server-side only.
+2. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. No measurable perf benefit in N=3 testing — recommend merging for the **correctness fix** (the original code was racy under concurrent commits and could lose revisions on emit error). NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness as forward-compat groundwork.
+3. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
 
 **Do not merge:**
 
@@ -219,15 +256,13 @@ The lever-3 (#7768) finding still stands but **for a different reason than origi
 
 ## Where to take this next
 
-The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With lever 3 merged, the per-emit application-level work is as cheap as it can get. Further ceiling extension needs to attack one of three surfaces:
-
-1. **Engine.io flush deferral.** The closed lever-8 attempt patched only the `send(packets[])` path; what's needed is to defer `socket.flush()` itself so multiple `sendPacket()` calls in the same task accumulate before drain. `queueMicrotask`-coalesced flush is the smallest behaviour change with the right shape. This is the natural sequel to [#7767](https://github.com/ether/etherpad/issues/7767).
+The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With #7774 (flush deferral) we have a modest tail-latency improvement; with #7768 we have a correctness fix that costs nothing. Further ceiling extension needs to attack one of two remaining surfaces:
 
-2. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production on 8+ vCPU sees the cliff move proportionally with no code changes. Per-pad multi-worker sharding lets a single host scale beyond single-core limits but is a much larger architectural change.
+1. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production on 8+ vCPU sees the cliff move proportionally with no code changes. Per-pad multi-worker sharding lets a single host scale beyond single-core limits but is a much larger architectural change.
 
-3. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted (see "Methodology caveat" above) is the template.
+2. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further.
 
-Direction (1) is the next concrete code investigation; (3) is methodology hygiene for all future investigations.
+The application-level surface has been explored end-to-end. Each non-trivial code lever that was thought to be a win turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The only application-level change with confirmed perf benefit is the engine.io flush deferral (#7774) — and it's a small one. **The cliff is hardware-bound, not code-bound, on the runner we measure on.** Production deployments with more cores per host will see proportionally higher ceilings without code changes.
 
 ## Reproducing
 

From 6e16b21be22f9af2076fa813b0ea6b8eb1a41f36 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 08:50:19 +0100
Subject: [PATCH 07/15] docs(scaling-dive): add lever 9 (SessionManager throw
 fix #7775)

CPU profile of develop at the 100-400 author dive sweep (load-test
run 25956384097) identified a ~6% process-CPU win in SessionManager:
throw-as-control-flow on every CLIENT_READY session lookup.

Add lever 9 section with the profile evidence, link the open PR
(#7775), and add a "Other CPU hotspots surfaced" subsection
documenting findings not yet acted on (Changeset internals,
appendRevision, ueberdb/dirty backing as test-harness artifact,
esbuild __name overhead). Update Recommendation to include #7775
as the highest-priority merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 72c2b38b35e..43cbc580fd5 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -217,6 +217,28 @@ Triple-running #7768 against develop *with matching matrix entry* (not cross-mat
 
 The serialization is still a real correctness fix (overlapping fan-outs on the same socket were racy under concurrent commits, and the rev-claim-with-rollback prevents lost revisions on emit error), but the **perf headline was wrong**. #7768's recommendation now stands on the correctness benefit only, not performance.
 
+### Lever 9 — SessionManager throw-as-control-flow (open as [#7775](https://github.com/ether/etherpad/pull/7775))
+
+**Hotspot identified via direct-Node CPU profile** of develop at the 100→400 author dive sweep (etherpad-load-test workflow [run 25956384097](https://github.com/ether/etherpad-load-test/actions/runs/25956384097), profile capture pipeline in load-test #109/#110/#111). The captured `.cpuprofile` shows two adjacent hotspots that share one root cause:
+
+- **1.82% self** in `new CustomError('sessionID does not exist', 'apierror')` (V8 stack-trace capture)
+- **4.12% inverted** in `Logger.<computed>` whose first non-log4js caller is `SecurityManager.checkAccess`
+
+The chain is `checkAccess → SessionManager.findAuthorID → getSessionInfo throws CustomError → catch → console.debug → log4js`. Every CLIENT_READY with a session cookie that doesn't resolve to a stored session executes this whole cascade. The cookie-less harness path is short-circuited at `findAuthorID` line 40, so the cost only fires when sessions are looked up — but in the dive sweep the harness drives that lookup on every message.
+
+**Fix (#7775):** add a non-throwing private `getSessionInfoOrNull` helper, route the two internal callers (`findAuthorID`, `listSessionsWithDBKey`) at it, and keep `exports.getSessionInfo` as a thin wrapper that preserves the throw for HTTP API compatibility (the API translates the thrown `apierror` to `code: 1`). All 32 cases in `tests/backend/specs/api/sessionsAndGroups.ts` pass, including "getSessionInfo of deleted session" which still expects `code: 1`.
+
+**Expected impact:** ~6% of total process CPU at the cliff. Score pending a dive sweep against the merged branch.
+
+### Other CPU hotspots surfaced (not yet acted on)
+
+The same profile also flagged:
+
+- **~25% in Changeset.ts internals** (`SmartOpAssembler`, `MergingOpAssembler`, `OpAssembler`, `StringIterator` — split across many anonymous slots). This is OT diff/merge core; not trivially optimizable without a rewrite.
+- **~13% in `Pad.appendRevision`** — dominated by `applyToAText` plus two parallel DB writes per revision (`pad:id:revs:N` and `pad:id`). Unavoidable correctness path.
+- **~13% in ueberdb `_setLocked` / `_write` / `evictOld` plus dirty-ts `_flush` / `writev`.** Most of this is *test-harness artifact* — the dive runs against the default `dirty.db` file-backed store. Production deployments with Postgres/SQLite see a different CPU profile here. Documenting so future readers don't chase this as a code lever.
+- **~4% attributable to `__name(fn, "...")` wrappers** (esbuild/tsx name-preservation helpers). May be reducible by shipping pre-built JS for production rather than transpiling at runtime via `tsx/cjs`; out of scope for this dive.
+
 ### Lever 8b — engine.io socket flush deferral (open as [#7774](https://github.com/ether/etherpad/pull/7774))
 
 Real follow-up to the closed lever 8. Instead of patching `transport.send(packets[])`, patch `Socket.prototype.sendPacket` to schedule a coalesced flush via `queueMicrotask`. Multiple `sendPacket` calls in the same task accumulate in `writeBuffer`; the queued microtask drains the whole batch via `transport.send`. The transport then sees N > 1 packets and the engine.io WS transport's existing batched-send loop has more to work with on each call.
@@ -241,9 +263,10 @@ Mechanism: deferred flush gives more packets per WS frame → fewer per-frame sy
 
 **Merge in priority order:**
 
-1. **[#7774](https://github.com/ether/etherpad/pull/7774)** — engine.io socket flush deferral. The only PR in this program with N=3-confirmed measurable perf improvement (tighter tail at step 300-350). Wire-compatible, server-side only.
-2. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. No measurable perf benefit in N=3 testing — recommend merging for the **correctness fix** (the original code was racy under concurrent commits and could lose revisions on emit error). NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness as forward-compat groundwork.
-3. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
+1. **[#7775](https://github.com/ether/etherpad/pull/7775)** — SessionManager throw-as-control-flow fix. CPU-profile-identified ~6% process CPU win at the cliff. No public-API behavior change; passes existing API test suite. Mechanical and low-risk.
+2. **[#7774](https://github.com/ether/etherpad/pull/7774)** — engine.io socket flush deferral. The only PR in this program with N=3-confirmed measurable perf improvement at the time it was opened (tighter tail at step 300-350). Wire-compatible, server-side only.
+3. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. No measurable perf benefit in N=3 testing — recommend merging for the **correctness fix** (the original code was racy under concurrent commits and could lose revisions on emit error). NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness as forward-compat groundwork.
+4. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
 
 **Do not merge:**
 
@@ -256,13 +279,13 @@ Mechanism: deferred flush gives more packets per WS frame → fewer per-frame sy
 
 ## Where to take this next
 
-The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With #7774 (flush deferral) we have a modest tail-latency improvement; with #7768 we have a correctness fix that costs nothing. Further ceiling extension needs to attack one of two remaining surfaces:
+The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With #7775 (session throw fix) we expect a ~6% process-CPU reduction at the cliff; with #7774 (flush deferral) a modest tail-latency improvement; with #7768 a correctness fix that costs nothing. Further ceiling extension needs to attack one of two remaining surfaces:
 
 1. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production on 8+ vCPU sees the cliff move proportionally with no code changes. Per-pad multi-worker sharding lets a single host scale beyond single-core limits but is a much larger architectural change.
 
 2. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further.
 
-The application-level surface has been explored end-to-end. Each non-trivial code lever that was thought to be a win turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The only application-level change with confirmed perf benefit is the engine.io flush deferral (#7774) — and it's a small one. **The cliff is hardware-bound, not code-bound, on the runner we measure on.** Production deployments with more cores per host will see proportionally higher ceilings without code changes.
+The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7774 (engine.io flush deferral) is a small confirmed win, and #7775 (SessionManager throw fix) is a clear ~6% win pending a sweep against the merged branch. **The cliff remains hardware-bound on the runner we measure on**, but production deployments will see two stacking wins from #7774 + #7775 without architectural change. Further code wins would need a Changeset/OT refactor (~25% of profile) — a much larger project.
 
 ## Reproducing
 

From eff3a01ab15cde4f9457ab0312283a621d7e8cd1 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 09:34:35 +0100
Subject: [PATCH 08/15] docs(scaling-dive): add N=3 measured numbers for lever
 9 (#7775)

Replace the "score pending" placeholder under lever 9 with the
actual numbers from runs 25957107195/25957108328/25957109418
(perf branch) vs 25954537767/25954538807/25954540108 (develop),
both at authors=100..500:step=50:dwell=8s:warmup=2s.

Result: consistent -1.4% to -5.3% CPU reduction across all 9 steps,
matching profile direction at 2-5% (vs 6% profile-attributed upper
bound). Latency delta sits inside the noise envelope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 43cbc580fd5..4e3f05f9be8 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -228,7 +228,19 @@ The chain is `checkAccess → SessionManager.findAuthorID → getSessionInfo thr
 
 **Fix (#7775):** add a non-throwing private `getSessionInfoOrNull` helper, route the two internal callers (`findAuthorID`, `listSessionsWithDBKey`) at it, and keep `exports.getSessionInfo` as a thin wrapper that preserves the throw for HTTP API compatibility (the API translates the thrown `apierror` to `code: 1`). All 32 cases in `tests/backend/specs/api/sessionsAndGroups.ts` pass, including "getSessionInfo of deleted session" which still expects `code: 1`.
 
-**Expected impact:** ~6% of total process CPU at the cliff. Score pending a dive sweep against the merged branch.
+**Measured impact (N=3 medians, perf branch vs develop, same `authors=100..500:step=50:dwell=8s:warmup=2s` sweep, perf runs 25957107195/25957108328/25957109418 vs develop runs 25954537767/25954538807/25954540108):**
+
+| step | dev CPU% | perf CPU% | ΔCPU% | dev p95 | perf p95 |
+|---:|---:|---:|---:|---:|---:|
+| 100 | 4.76 | 4.67 | -1.7% | 38 | 38 |
+| 200 | 15.21 | 14.60 | -4.0% | 37 | 41 |
+| 300 | 30.46 | 29.68 | -2.6% | 45 | 45 |
+| 350 | 41.58 | 39.36 | **-5.3%** | 39 | 74 |
+| 400 | 56.26 | 54.23 | -3.6% | 2275 | 2089 |
+| 450 | 72.33 | 70.49 | -2.5% | 6167 | 5891 |
+| 500 | 88.38 | 87.14 | -1.4% | 11759 | 11391 |
+
+**ΔCPU% is consistently negative (-1.4% to -5.3%) across all 9 steps** — the direction matches the profile prediction. The realised magnitude (2-5%) is below the profile-attributed 6% upper bound because some of the log4js cost the profile attributed to the throw path was unrelated startup/info logging. Latency impact is mostly inside the noise envelope; step 350 looks regressive at the median but the raw triples (dev [39,39,122] vs perf [73,74,124]) overlap heavily with one outlier each.
 
 ### Other CPU hotspots surfaced (not yet acted on)
 
@@ -263,7 +275,7 @@ Mechanism: deferred flush gives more packets per WS frame → fewer per-frame sy
 
 **Merge in priority order:**
 
-1. **[#7775](https://github.com/ether/etherpad/pull/7775)** — SessionManager throw-as-control-flow fix. CPU-profile-identified ~6% process CPU win at the cliff. No public-API behavior change; passes existing API test suite. Mechanical and low-risk.
+1. **[#7775](https://github.com/ether/etherpad/pull/7775)** — SessionManager throw-as-control-flow fix. N=3 measured 2-5% CPU% reduction across the cliff sweep (profile-predicted 6% upper bound). No public-API behavior change; passes existing API test suite. Mechanical and low-risk.
 2. **[#7774](https://github.com/ether/etherpad/pull/7774)** — engine.io socket flush deferral. The only PR in this program with N=3-confirmed measurable perf improvement at the time it was opened (tighter tail at step 300-350). Wire-compatible, server-side only.
 3. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. No measurable perf benefit in N=3 testing — recommend merging for the **correctness fix** (the original code was racy under concurrent commits and could lose revisions on emit error). NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness as forward-compat groundwork.
 4. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.

From 1ee3e9e62b36d58f019415129f366fb03635217e Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 12:04:22 +0100
Subject: [PATCH 09/15] docs(scaling-dive): #7775+#7776 stacked = -12% to -20%
 CPU, cliff moves

Three combined-branch runs (perf/dive-combined = #7776 cherry-picked
onto #7775 base; runs 25960003164/25960004223/25960005248) vs the
same three develop baselines: -12% to -20% CPU% across all 9 sweep
steps, with the p95 cliff effectively moving from ~400 to ~500
authors (at step 400, two of three combined runs land below the
cliff at 45ms and 112ms p95 vs develop [1758, 2275, 2463]).

Adds:
- Lever 10 section for #7776 with its own N=3 numbers (-3.6 to -8.9%
  alone).
- "Stacking" section showing super-additive interaction.
- Local vCPU experiment showing the cliff is single-event-loop-bound,
  not total-CPU-bound: 4-core and 8-core pinned SUTs hit the same
  cliff at the same step.
- Updated TL;DR, Recommendation order (merge both #7775+#7776 first),
  and "Where to take this next" with worker-thread offload as the
  smallest next architectural step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 86 +++++++++++++++++++++++++++++++++---
 1 file changed, 79 insertions(+), 7 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 4e3f05f9be8..ef1db1b247c 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -24,6 +24,8 @@ Every claim links to a CI run whose `report.json` is downloadable for re-analysi
 
 The next concrete direction with leverage is **engine.io transport-level packing** — sending multiple engine.io packets in one WebSocket frame instead of one frame per packet. See "Where to take this next" below.
 
+**Update (later in the dive):** CPU profiling against the SUT under load identified two adjacent log4js entry paths that together drive **-12% to -20% of total process CPU** when fixed in combination — see [#7775](https://github.com/ether/etherpad/pull/7775) (SessionManager throw-as-control-flow) and [#7776](https://github.com/ether/etherpad/pull/7776) (settings.loadTest per-message warn). At step 400, two of three N=3 combined-branch runs landed *below* the cliff entirely. **This effectively moves the cliff from ~400 to ~500 authors.** A local taskset experiment confirmed the remaining cliff is single-event-loop-bound, not total-CPU-bound: 4-core and 8-core SUTs hit the cliff at the same step. Worker-thread offload of OT (~25% of profile) is the smallest next architectural step.
+
 ## Methodology
 
 - **Harness:** [`ether/etherpad-load-test`](https://github.com/ether/etherpad-load-test) at `main`. `--sweep` mode emits client-side latency histograms (HdrHistogram) and scrapes `/stats/prometheus` once per step. Reports as `report.json`/`csv`/`md`.
@@ -251,6 +253,74 @@ The same profile also flagged:
 - **~13% in ueberdb `_setLocked` / `_write` / `evictOld` plus dirty-ts `_flush` / `writev`.** Most of this is *test-harness artifact* — the dive runs against the default `dirty.db` file-backed store. Production deployments with Postgres/SQLite see a different CPU profile here. Documenting so future readers don't chase this as a code lever.
 - **~4% attributable to `__name(fn, "...")` wrappers** (esbuild/tsx name-preservation helpers). May be reducible by shipping pre-built JS for production rather than transpiling at runtime via `tsx/cjs`; out of scope for this dive.
 
+### Lever 10 — `settings.loadTest` per-message warn (open as [#7776](https://github.com/ether/etherpad/pull/7776))
+
+While capturing the lever-9 profile against the *post-#7775* perf branch ([run 25957515210](https://github.com/ether/etherpad-load-test/actions/runs/25957515210)), the log4js cost (4% of total CPU, inverted-caller pointing at `SecurityManager.checkAccess`) was *unchanged* — which surfaced the real root cause. Line 78-81 of `SecurityManager.ts`:
+
+```ts
+if (settings.loadTest) {
+  console.warn(
+      'bypassing socket.io authentication and authorization checks due to settings.loadTest');
+}
+```
+
+…fires on every `checkAccess` invocation — once per inbound socket.io message. `log4js.replaceConsole` routes the `console.warn` through `Logger._log → sendToListeners → sendLogEventToAppender`, paying full LogEvent allocation + dispatch on every CLIENT_READY, COMMIT_CHANGESET, etc.
+
+**Fix (#7776):** drop the per-message log (the loadTest short-circuit still applies), move the configuration warning to startup in `Settings.ts` next to the other config-time warnings. Production unaffected (`loadTest: false` by default); dive harness and any benchmark/staging setup with `loadTest: true` gets the savings.
+
+**N=3 measured impact** (runs 25959515488/25959516741/25959517823 vs the same develop baselines used elsewhere):
+
+| step | dev CPU% | #7776 CPU% | **ΔCPU%** | dev p95 | #7776 p95 |
+|---:|---:|---:|---:|---:|---:|
+| 100 | 4.76 | 4.51 | **-5.3%** | 38 | 33 |
+| 200 | 15.21 | 14.33 | -5.8% | 37 | 31 |
+| 300 | 30.46 | 28.50 | -6.4% | 45 | 46 |
+| 350 | 41.58 | 37.87 | **-8.9%** | 39 | 59\* |
+| 400 | 56.26 | 53.67 | -4.6% | 2275 | **1903** (-16%) |
+| 450 | 72.33 | 68.80 | -4.9% | 6167 | **5527** (-10%) |
+| 500 | 88.38 | 85.17 | -3.6% | 11759 | **10655** (-9%) |
+
+\*step 350 raw triples: dev [39, 39, 122] vs #7776 [37, 38, 39] — #7776's distribution is *tighter* across all 3 runs (no single-run dip below 37); the median doesn't show this.
+
+CPU% drops -3.6% to -8.9% across all 9 steps with consistent direction in every N=3 raw triple. Past the cliff (400+), p95 drops 9-16% — the SUT processes the same load more quickly when the loadTest warning isn't competing for log4js dispatch.
+
+### Stacking lever 9 (#7775) and lever 10 (#7776)
+
+The two CPU-profile-identified levers attack adjacent log4js entry paths. Three combined-branch runs (perf/dive-combined = #7776 + #7775 cherry-picked, runs 25960003164/25960004223/25960005248) vs the same three develop baselines:
+
+| step | dev CPU% | #7775 | #7776 | **both** | Δ#7775 | Δ#7776 | **Δboth** |
+|---:|---:|---:|---:|---:|---:|---:|---:|
+| 100 | 4.76 | 4.67 | 4.51 | 3.99 | -1.7% | -5.3% | **-16.1%** |
+| 200 | 15.21 | 14.60 | 14.33 | 12.48 | -4.0% | -5.8% | **-17.9%** |
+| 300 | 30.46 | 29.68 | 28.50 | 24.39 | -2.6% | -6.4% | **-19.9%** |
+| 350 | 41.58 | 39.36 | 37.87 | 33.04 | -5.3% | -8.9% | **-20.5%** |
+| 400 | 56.26 | 54.23 | 53.67 | 44.78 | -3.6% | -4.6% | **-20.4%** |
+| 450 | 72.33 | 70.49 | 68.80 | 61.18 | -2.5% | -4.9% | **-15.4%** |
+| 500 | 88.38 | 87.14 | 85.17 | 77.70 | -1.4% | -3.6% | **-12.1%** |
+
+The stacked impact (-12% to -20% CPU%) is **super-additive** — well above the simple sum of the two individual gains. Both fixes remove call sites that funnel into the same log4js cluster-mode dispatch chain (`sendToListeners → sendLogEventToAppender`); halving the LogEvent allocation rate appears to relieve queue / GC pressure beyond what either fix accounts for in isolation.
+
+**Latency impact** (p95, raw triples shown to expose the cliff-shift):
+
+| step | develop p95 [3 runs] | combined p95 [3 runs] |
+|---:|---|---|
+| 400 | [1758, 2275, 2463] | **[45, 112, 634]** |
+| 450 | [5415, 6167, 6611] | [3297, 3719, 3897] (-40%) |
+| 500 | [10655, 11759, 12183] | [8091, 8711, 9127] (-26%) |
+
+At step 400, **two of three combined runs land below the cliff entirely** (45ms, 112ms) — the cliff has effectively moved from ~400 to ~500 authors. At step 500 the cliff is still there but the SUT processes load 26% faster. This is the largest measured single-direction perf improvement in the dive.
+
+### Local vCPU-scaling experiment
+
+To answer "is the cliff CPU-bound or event-loop-bound", I ran the same dive sweep locally against a develop SUT pinned via `taskset -c` to varying core counts (Ryzen 5 3600, 12 threads; harness on disjoint cores to avoid contention):
+
+| SUT cores | Cliff (p95 spike) | CPU% @ step 500 |
+|---:|---:|---:|
+| 4 (pinned 0-3) | ~350 | 97.6% |
+| 8 (pinned 0-7) | ~350 | 96.4% |
+
+Doubling cores produced no improvement. The 96-98% CPU% reading is `process.cpuUsage()` against a single Node thread — it maxes out at one full core. **The cliff is single-event-loop-bound, not total-CPU-bound.** Adding cores via cluster-mode or bigger boxes does not move the cliff for a single Etherpad process. The application-layer levers (this dive) are the only way forward at fixed process count, and worker-thread offload of OT (~25% of profile spent in `Changeset.applyToAText`) is the next architectural step worth a separate program of work.
+
 ### Lever 8b — engine.io socket flush deferral (open as [#7774](https://github.com/ether/etherpad/pull/7774))
 
 Real follow-up to the closed lever 8. Instead of patching `transport.send(packets[])`, patch `Socket.prototype.sendPacket` to schedule a coalesced flush via `queueMicrotask`. Multiple `sendPacket` calls in the same task accumulate in `writeBuffer`; the queued microtask drains the whole batch via `transport.send`. The transport then sees N > 1 packets and the engine.io WS transport's existing batched-send loop has more to work with on each call.
@@ -275,10 +345,12 @@ Mechanism: deferred flush gives more packets per WS frame → fewer per-frame sy
 
 **Merge in priority order:**
 
-1. **[#7775](https://github.com/ether/etherpad/pull/7775)** — SessionManager throw-as-control-flow fix. N=3 measured 2-5% CPU% reduction across the cliff sweep (profile-predicted 6% upper bound). No public-API behavior change; passes existing API test suite. Mechanical and low-risk.
-2. **[#7774](https://github.com/ether/etherpad/pull/7774)** — engine.io socket flush deferral. The only PR in this program with N=3-confirmed measurable perf improvement at the time it was opened (tighter tail at step 300-350). Wire-compatible, server-side only.
-3. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. No measurable perf benefit in N=3 testing — recommend merging for the **correctness fix** (the original code was racy under concurrent commits and could lose revisions on emit error). NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness as forward-compat groundwork.
-4. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
+0. **Merge #7775 + [#7776](https://github.com/ether/etherpad/pull/7776) together.** They attack adjacent log4js entry paths and N=3 measured combined impact is **-12% to -20% CPU% across the full cliff sweep**, with the p95 cliff effectively shifting from ~400 → ~500 authors (two of three combined runs at step 400 land below the cliff entirely). Super-additive interaction — landing only one captures < half the win.
+1. **[#7775](https://github.com/ether/etherpad/pull/7775)** — SessionManager throw-as-control-flow fix. N=3 measured 2-5% CPU% reduction alone (less when paired). No public-API behavior change; passes existing API test suite. Mechanical and low-risk.
+2. **[#7776](https://github.com/ether/etherpad/pull/7776)** — `settings.loadTest` per-message warning. N=3 measured 3.6-8.9% CPU% reduction alone. Test-harness-facing today but always-on logical cleanup. See item 0 for the recommended packaging.
+3. **[#7774](https://github.com/ether/etherpad/pull/7774)** — engine.io socket flush deferral. Tighter tail at step 300-350 (N=3). Wire-compatible, server-side only.
+4. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. No measurable perf benefit in N=3 testing — recommend merging for the **correctness fix** (the original code was racy under concurrent commits and could lose revisions on emit error). NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness as forward-compat groundwork.
+5. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
 
 **Do not merge:**
 
@@ -291,13 +363,13 @@ Mechanism: deferred flush gives more packets per WS frame → fewer per-frame sy
 
 ## Where to take this next
 
-The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With #7775 (session throw fix) we expect a ~6% process-CPU reduction at the cliff; with #7774 (flush deferral) a modest tail-latency improvement; with #7768 a correctness fix that costs nothing. Further ceiling extension needs to attack one of two remaining surfaces:
+The dive's cliff at 350-400 authors is **single-event-loop saturation on one core, regardless of host vCPU count** (confirmed by local taskset experiment: 4-core and 8-core SUTs hit the same cliff at the same step with one full core busy). With #7775+#7776 stacked the cliff effectively moves from ~400 to ~500 authors and CPU% drops 12-20% across the whole sweep. With #7774 (flush deferral) a modest tail-latency improvement on top. With #7768 a correctness fix that costs nothing. Further ceiling extension needs to attack one of two remaining surfaces:
 
-1. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production on 8+ vCPU sees the cliff move proportionally with no code changes. Per-pad multi-worker sharding lets a single host scale beyond single-core limits but is a much larger architectural change.
+1. **Worker-thread offload of OT.** ~25% of CPU is in `Changeset.applyToAText` and friends — pure computation that could run in a worker thread or worker pool. The main event loop becomes a coordinator; the heavy lift parallelises. Verified necessary by the local vCPU experiment above: bigger boxes do *not* move the cliff because Etherpad uses one core regardless. Worker threads is the smallest architectural change that lifts the single-event-loop ceiling.
 
 2. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further.
 
-The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7774 (engine.io flush deferral) is a small confirmed win, and #7775 (SessionManager throw fix) is a clear ~6% win pending a sweep against the merged branch. **The cliff remains hardware-bound on the runner we measure on**, but production deployments will see two stacking wins from #7774 + #7775 without architectural change. Further code wins would need a Changeset/OT refactor (~25% of profile) — a much larger project.
+The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7775 + #7776 stacked deliver -12% to -20% CPU% with the cliff effectively shifting from ~400 to ~500 authors — the biggest single-direction perf improvement in this program, and the first set of changes that move the cliff position itself rather than just thinning the tail. #7774 layers a modest additional tail-latency improvement on top. **Past this point the cliff is no longer hardware-bound; it's single-event-loop-bound** — verified by the local taskset experiment showing the cliff doesn't move when you give Etherpad more cores. Worker-thread offload of OT is the smallest architectural change that lifts the ceiling further — a separate program of work.
 
 ## Reproducing
 

From 661e829aef393e4de3b77af909f2fe67263b49fa Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 12:12:17 +0100
Subject: [PATCH 10/15] docs(scaling-dive): scope worker-thread first cut for
 applyToText

Post-#7775/#7776 profile shows applyToAText splits cleanly:
- applyToText (Changeset.ts:404) is pure (cs, text) -> text; trivially
  offloadable to a worker via worker_threads structured-clone postMessage.
- applyToAttribution (Changeset.ts:684) mutates AttributePool; not
  trivially offloadable.

Document the obvious first-pass design (run them in parallel via
Promise.all inside applyToAText) and the realistic estimate (~6-8%
CPU moved off the main event loop). putAttrib is only 0.26% in the
post-fix profile, confirming the bulk of applyToAText's cost is in
the string-manipulation half.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index ef1db1b247c..522f4390fd7 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -367,6 +367,12 @@ The dive's cliff at 350-400 authors is **single-event-loop saturation on one cor
 
 1. **Worker-thread offload of OT.** ~25% of CPU is in `Changeset.applyToAText` and friends — pure computation that could run in a worker thread or worker pool. The main event loop becomes a coordinator; the heavy lift parallelises. Verified necessary by the local vCPU experiment above: bigger boxes do *not* move the cliff because Etherpad uses one core regardless. Worker threads is the smallest architectural change that lifts the single-event-loop ceiling.
 
+   **Concrete first-pass design.** `applyToAText(cs, atext, pool)` (`Changeset.ts:1060`) returns `{text: applyToText(cs, atext.text), attribs: applyToAttribution(cs, atext.attribs, pool)}`. The two halves are independent:
+   - `applyToText` (`Changeset.ts:404`) is a **pure function** of `(cs, text)`. Trivially offloadable to a worker pool via `node:worker_threads`. No shared state to negotiate; strings copy via `postMessage` structured clone.
+   - `applyToAttribution` (`Changeset.ts:684`) mutates `AttributePool` via `putAttrib`. Not trivially offloadable.
+   
+   The simplest first cut: dispatch `applyToText` to a worker while `applyToAttribution` runs on the main thread; `await Promise.all([workerText, mainAttrib])` inside `applyToAText`. The post-#7775/#7776 profile shows `putAttrib` is only 0.26% of CPU, so the bulk of the ~13% appendRevision share is in `applyToText` (string ops + `StringIterator` + `StringAssembler`). Plausible offload: ~6-8% of process CPU moved off the main event loop, directly recovering cliff headroom on a single Node process. Worth a focused PoC against one worker thread before deciding pool size.
+
 2. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further.
 
 The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7775 + #7776 stacked deliver -12% to -20% CPU% with the cliff effectively shifting from ~400 to ~500 authors — the biggest single-direction perf improvement in this program, and the first set of changes that move the cliff position itself rather than just thinning the tail. #7774 layers a modest additional tail-latency improvement on top. **Past this point the cliff is no longer hardware-bound; it's single-event-loop-bound** — verified by the local taskset experiment showing the cliff doesn't move when you give Etherpad more cores. Worker-thread offload of OT is the smallest architectural change that lifts the ceiling further — a separate program of work.

From a0df33614b79709ec796e651fb80bfb768535bb6 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 12:23:15 +0100
Subject: [PATCH 11/15] docs(scaling-dive): per-call worker-thread offload
 falsified
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Microbenchmark on branch experiment/worker-thread-applytotext shows
that dispatching applyToText to a worker is net-NEGATIVE at every
realistic pad size (+11% to +326% overhead). The string postMessage
serialization cost exceeds the per-call applyToText work for our
workload (typical pads are 1-10KB, calls 17-86 µs sync, dispatch
overhead 40-90 µs).

Replace the earlier "Concrete first-pass design" recommendation
(which assumed worker offload would win) with the actual numbers
and reframe the architectural next step as per-pad worker isolation
(handoff serialization paid once at pad ownership transfer rather
than per changeset).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 522f4390fd7..8831c2262c5 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -365,17 +365,23 @@ Mechanism: deferred flush gives more packets per WS frame → fewer per-frame sy
 
 The dive's cliff at 350-400 authors is **single-event-loop saturation on one core, regardless of host vCPU count** (confirmed by local taskset experiment: 4-core and 8-core SUTs hit the same cliff at the same step with one full core busy). With #7775+#7776 stacked the cliff effectively moves from ~400 to ~500 authors and CPU% drops 12-20% across the whole sweep. With #7774 (flush deferral) a modest tail-latency improvement on top. With #7768 a correctness fix that costs nothing. Further ceiling extension needs to attack one of two remaining surfaces:
 
-1. **Worker-thread offload of OT.** ~25% of CPU is in `Changeset.applyToAText` and friends — pure computation that could run in a worker thread or worker pool. The main event loop becomes a coordinator; the heavy lift parallelises. Verified necessary by the local vCPU experiment above: bigger boxes do *not* move the cliff because Etherpad uses one core regardless. Worker threads is the smallest architectural change that lifts the single-event-loop ceiling.
+1. **Per-call worker-thread offload of `applyToText` — falsified by microbenchmark.** Initial hypothesis: `applyToText` is pure-functional (Changeset.ts:404), so dispatching it to a `node:worker_threads` worker would free the main event loop for the duration of the call. Per-call benchmark (branch `experiment/worker-thread-applytotext`, file `src/scaling-bench/applyToText-bench.ts`) on the same Ryzen 5 3600 box, Node 25.9.0:
 
-   **Concrete first-pass design.** `applyToAText(cs, atext, pool)` (`Changeset.ts:1060`) returns `{text: applyToText(cs, atext.text), attribs: applyToAttribution(cs, atext.attribs, pool)}`. The two halves are independent:
-   - `applyToText` (`Changeset.ts:404`) is a **pure function** of `(cs, text)`. Trivially offloadable to a worker pool via `node:worker_threads`. No shared state to negotiate; strings copy via `postMessage` structured clone.
-   - `applyToAttribution` (`Changeset.ts:684`) mutates `AttributePool` via `putAttrib`. Not trivially offloadable.
-   
-   The simplest first cut: dispatch `applyToText` to a worker while `applyToAttribution` runs on the main thread; `await Promise.all([workerText, mainAttrib])` inside `applyToAText`. The post-#7775/#7776 profile shows `putAttrib` is only 0.26% of CPU, so the bulk of the ~13% appendRevision share is in `applyToText` (string ops + `StringIterator` + `StringAssembler`). Plausible offload: ~6-8% of process CPU moved off the main event loop, directly recovering cliff headroom on a single Node process. Worth a focused PoC against one worker thread before deciding pool size.
+   | text size | sync (µs/call) | worker round-trip (µs/call) | worker overhead |
+   |---:|---:|---:|---:|
+   | 1 KB | 17 | 57 | **+244%** |
+   | 10 KB | 43 | 48 | +11% |
+   | 100 KB | 86 | 174 | +102% |
+   | 500 KB | 341 | 1384 | +306% |
+   | 2 MB | 1507 | 6419 | +326% |
+
+   At every realistic pad size the worker dispatch is slower than synchronous execution, *and the slowness is paid on the main thread* (structured-clone serialization of the input string + deserialization of the output string both run in the caller's isolate). The "free up the event loop" win never materialises: per-call work (17-86 µs for typical pad sizes) is smaller than per-call postMessage overhead (40-90 µs). V8 isolate boundaries do not share strings; `Transferable` and `SharedArrayBuffer` paths don't apply to string content. **Per-call offload is net-negative.**
+
+2. **Per-pad worker isolation (next architectural lever).** The right shape for parallelism in Etherpad is one level higher: each pad's lifecycle runs in its own worker thread (or process); the main thread is a thin router that hands sockets off to the pad worker and forwards outbound messages back. Serialization happens **once at handoff**, not per changeset; OT work for different pads parallelises across cores; existing `applyToText`/`applyToAttribution` stays synchronous *inside* the pad worker. The dive's "more authors per pad" question is still bounded by one event loop per pad — but the program's overall ceiling (authors-across-all-pads) scales with core count. Sizing the change correctly is a separate program of work; this dive does not scope it further.
 
 2. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further.
 
-The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7775 + #7776 stacked deliver -12% to -20% CPU% with the cliff effectively shifting from ~400 to ~500 authors — the biggest single-direction perf improvement in this program, and the first set of changes that move the cliff position itself rather than just thinning the tail. #7774 layers a modest additional tail-latency improvement on top. **Past this point the cliff is no longer hardware-bound; it's single-event-loop-bound** — verified by the local taskset experiment showing the cliff doesn't move when you give Etherpad more cores. Worker-thread offload of OT is the smallest architectural change that lifts the ceiling further — a separate program of work.
+The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7775 + #7776 stacked deliver -12% to -20% CPU% with the cliff effectively shifting from ~400 to ~500 authors — the biggest single-direction perf improvement in this program, and the first set of changes that move the cliff position itself rather than just thinning the tail. #7774 layers a modest additional tail-latency improvement on top. **Past this point the cliff is no longer hardware-bound; it's single-event-loop-bound** — verified by the local taskset experiment showing the cliff doesn't move when you give Etherpad more cores. Per-call worker-thread offload of `applyToText` was prototyped and falsified (postMessage overhead exceeds the work; see "Where to take this next" below). The remaining architectural lever for *one pad with N authors* is per-pad worker isolation; for *N pads across many cores* it's a sticky-session cluster — both substantially larger changes.
 
 ## Reproducing
 

From f20d56ec691ac7b7909631374af1ad5fcd0a3962 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 12:31:08 +0100
Subject: [PATCH 12/15] docs(scaling-dive): link #7780 (room-broadcast fan-out)
 as next lever after worker-thread falsification

---
 docs/scaling-dive-2026-05.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 8831c2262c5..a6015361979 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -379,7 +379,9 @@ The dive's cliff at 350-400 authors is **single-event-loop saturation on one cor
 
 2. **Per-pad worker isolation (next architectural lever).** The right shape for parallelism in Etherpad is one level higher: each pad's lifecycle runs in its own worker thread (or process); the main thread is a thin router that hands sockets off to the pad worker and forwards outbound messages back. Serialization happens **once at handoff**, not per changeset; OT work for different pads parallelises across cores; existing `applyToText`/`applyToAttribution` stays synchronous *inside* the pad worker. The dive's "more authors per pad" question is still bounded by one event loop per pad — but the program's overall ceiling (authors-across-all-pads) scales with core count. Sizing the change correctly is a separate program of work; this dive does not scope it further.
 
-2. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further.
+3. **Room-broadcast `updatePadClients` fan-out — filed as [#7780](https://github.com/ether/etherpad/issues/7780).** With #7775+#7776 merged, the next visible cluster in the post-fix profile is socket.io's per-recipient packet construction inside `PadMessageHandler.updatePadClients` (~10% of CPU: emit 3.36% + packet 3.56% + _packet 3.31%). The fan-out loop today does `socket.emit('message', msg)` per recipient — N packet constructions of essentially identical content (only `timeDelta` and `currentTime` differ per recipient, and both fields are timeslider-only; live `collab_client.ts` ignores them). Swapping to `io.in(padId).emit(msg)` collapses N encode calls into 1 via the in-memory adapter's `broadcast()` path. Realistic savings: ~5-7% CPU at the dive cliff. Implementation isn't trivial because of the catch-up case (lagging sockets silently drop messages with `newRev !== rev + 1`); see the issue for the design choice between "split steady-state from catch-up" (Shape A) vs "push catch-up to a CLIENT_REQUEST_RESEND path" (Shape B).
+
+4. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further.
 
 The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7775 + #7776 stacked deliver -12% to -20% CPU% with the cliff effectively shifting from ~400 to ~500 authors — the biggest single-direction perf improvement in this program, and the first set of changes that move the cliff position itself rather than just thinning the tail. #7774 layers a modest additional tail-latency improvement on top. **Past this point the cliff is no longer hardware-bound; it's single-event-loop-bound** — verified by the local taskset experiment showing the cliff doesn't move when you give Etherpad more cores. Per-call worker-thread offload of `applyToText` was prototyped and falsified (postMessage overhead exceeds the work; see "Where to take this next" below). The remaining architectural lever for *one pad with N authors* is per-pad worker isolation; for *N pads across many cores* it's a sticky-session cluster — both substantially larger changes.
 

From 0e8cb97c94472434213d608333da47f0e0d0f159 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 13:22:44 +0100
Subject: [PATCH 13/15] docs(scaling-dive): tiered roadmap for future effort
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a "Roadmap for future effort" section ahead of Reproducing,
ranking the next concrete options by impact-per-time-spent.

Tier 1 (mechanical / <1 day each):
- merge ready perf PRs (#7775+#7776+#7774)
- implement #7780 room-broadcast fan-out
- additional post-fix profile pass

Tier 2 (medium, real cliff moves):
- selective fan-out / viewport-based broadcast (~2 weeks; cliff ~500 → 1000-1500)
- per-pad worker isolation PoC (~1-2 weeks PoC, 1-2 months prod)

Tier 3 (large bets):
- sticky-session cluster mode (~2-4 weeks PoC)
- CRDT migration (months; anti-recommended)

Tier 4 (operational):
- production telemetry hookup (~3-5 days)
- nightly dive in CI (~1 day)

Records the recommended sequence (Tier 1.2 → Tier 2.4) so the
next person picking this up doesn't need to re-derive it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index a6015361979..7ed75fb8deb 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -385,6 +385,40 @@ The dive's cliff at 350-400 authors is **single-event-loop saturation on one cor
 
 The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7775 + #7776 stacked deliver -12% to -20% CPU% with the cliff effectively shifting from ~400 to ~500 authors — the biggest single-direction perf improvement in this program, and the first set of changes that move the cliff position itself rather than just thinning the tail. #7774 layers a modest additional tail-latency improvement on top. **Past this point the cliff is no longer hardware-bound; it's single-event-loop-bound** — verified by the local taskset experiment showing the cliff doesn't move when you give Etherpad more cores. Per-call worker-thread offload of `applyToText` was prototyped and falsified (postMessage overhead exceeds the work; see "Where to take this next" below). The remaining architectural lever for *one pad with N authors* is per-pad worker isolation; for *N pads across many cores* it's a sticky-session cluster — both substantially larger changes.
 
+## Roadmap for future effort
+
+Concrete options for whoever picks this up next, ordered roughly by impact-per-time-spent. **For "more authors per pad"** the answer is Tier 1 then Tier 2 option 4; **for "more pads per box"** the answer is Tier 2 option 5 or Tier 3 option 6.
+
+### Tier 1 — small, mostly mechanical
+
+1. **Merge the 3 ready perf PRs** (#7775 + #7776 + #7774). *Cost: review + merge time only, no dev.* Locks in the −12-20% already measured by this dive. The blocker is a maintainer call, not engineering work.
+
+2. **Implement [#7780](https://github.com/ether/etherpad/issues/7780)** (room-broadcast fan-out in `updatePadClients`). Shape A from the issue: split steady-state from catch-up. *Cost: ~1 day code + N=3 dive verification.* Predicted **+5-7% CPU headroom**; cliff likely from ~500 → ~550 authors.
+
+3. **One more pass through the post-fix profile** looking for the same shape of bug as #7776 (per-message work that shouldn't be per-message). *Cost: ~half a day.* Diminishing returns — maybe 1-2 small wins at 1-3% each. Cheap to look, easy to abandon.
+
+### Tier 2 — medium projects, real cliff moves
+
+4. **Selective fan-out / viewport-based broadcast.** Don't send every edit to every author; full edits to ~20 authors near each cursor, digests every 1-2s to the rest. Requires viewport tracking per socket and a "digest" message type. *Cost: ~2 weeks for a feature-flagged version + dive verification.* Plausible: cliff moves from ~500 → 1000-1500 authors. **Biggest single user-visible win that doesn't change the architecture.**
+
+5. **Per-pad worker isolation PoC.** Each pad's lifecycle runs in one worker thread; the main thread is a router. Serialization paid once at pad handoff, not per changeset. *Cost: ~1-2 weeks PoC, 1-2 months production-ready.* Does **not** move the per-pad cliff (still one event loop per pad) — wins on program-wide scaling (many pads × cores). Necessary precursor for Tier 3 option 6.
+
+### Tier 3 — large bets, mostly to know we have them
+
+6. **Sticky-session cluster mode.** Multi-process, pads partitioned across workers. *Cost: ~2-4 weeks PoC.* Same scaling shape as option 5 but coarser-grained and works without restructuring the in-process code. Doesn't help "one pad with N authors" either.
+
+7. **CRDT migration (Yjs / Automerge).** Native peer-to-peer scaling without a central coordinator. *Cost: months.* **Breaks every plugin** in the ecosystem and re-litigates the editor protocol. *Anti-recommended* unless options 1-6 fail to deliver and there's a hard product requirement for thousands of authors per pad.
+
+### Tier 4 — operational, not a code lever but valuable
+
+8. **Production telemetry instrumentation.** Wire the `scalingDiveMetrics` Prometheus surface (added by #7762) into a real dashboard against a live deployment. *Cost: ~3-5 days.* Tells us whether dive numbers (Github runner, dirty.db backing) match production reality (real boxes, Postgres). Important before committing to Tier 2.
+
+9. **Nightly dive in CI.** N=3 sweep against `develop` once a day, flagging regressions vs the previous week's median. *Cost: ~1 day.* Catches future regressions early. Out of scope for this dive (see below) but cheap to add now that the harness is stable.
+
+### Recommended next move
+
+**Option 2 (implement #7780).** It's the only Tier 1 item that needs code; it's bounded; it has a clear measurement plan from the issue; and it moves the cliff a measurable extra ~10%. After that lands, **Tier 2 option 4 (selective fan-out)** is the biggest user-visible win for 1000+ authors per pad and is the natural next program of work.
+
 ## Reproducing
 
 ```

From f966e62da8f9a93f36311c1bd698917a1f10a2b3 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 13:24:51 +0100
Subject: [PATCH 14/15] docs(scaling-dive): clarify Tier 3 ecosystem impact
 (clients + plugins)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CRDT migration's blast radius is wider than "plugins" — the changeset
wire format is the lingua franca for the Electron/Capacitor desktop
app, the mobile app, etherpad-cli, MCP servers, and server plugins.
A CRDT switch means parallel reimplementation in every consumer, not
just core. Upgrade the wording from "anti-recommended" to "strongly
anti-recommended" with the actual list of affected projects.

Cluster mode in contrast is mostly transparent to wire-protocol
consumers; the only deployment shape that needs care is in-process
embeds (Electron/Capacitor bundle a single Node process for a single
user — they can skip cluster mode entirely).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 7ed75fb8deb..403b34636ff 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -407,7 +407,9 @@ Concrete options for whoever picks this up next, ordered roughly by impact-per-t
 
 6. **Sticky-session cluster mode.** Multi-process, pads partitioned across workers. *Cost: ~2-4 weeks PoC.* Same scaling shape as option 5 but coarser-grained and works without restructuring the in-process code. Doesn't help "one pad with N authors" either.
 
-7. **CRDT migration (Yjs / Automerge).** Native peer-to-peer scaling without a central coordinator. *Cost: months.* **Breaks every plugin** in the ecosystem and re-litigates the editor protocol. *Anti-recommended* unless options 1-6 fail to deliver and there's a hard product requirement for thousands of authors per pad.
+   *Ecosystem impact:* mostly transparent to clients — they connect to the server URL as usual; the load balancer (or in-process router) handles stickiness. **Desktop apps that embed the server in-process** (Electron / Capacitor bundles a single Node process for one user) would either skip cluster mode entirely (single-user, no concurrency need) or bundle the cluster manager too — a per-deployment decision, not a protocol break. **Mobile**, **terminal / etherpad-cli**, and **MCP** clients are all wire-protocol consumers and unaffected.
+
+7. **CRDT migration (Yjs / Automerge).** Native peer-to-peer scaling without a central coordinator. *Cost: months — but the headline cost is wire-protocol replacement, not the editor swap.* The Etherpad changeset format is the lingua franca for **everything that talks to a pad**: the web client, the **Electron / Capacitor desktop app** (embeds the web client), the **mobile app** (Phase 1 packaging merged 2026-05-11, wraps the same web client), **etherpad-cli** (printingpress.dev integration speaks changesets directly), **MCP servers** (any wrap pad ops via changeset semantics), and every server-side **plugin** that intercepts or transforms changesets. A CRDT migration replaces the changeset wire format with Yjs binary updates and requires parallel reimplementation in every one of those consumers — not a refactor, a fork. **Strongly anti-recommended** unless options 1-6 fail to deliver and there's a hard product requirement for thousands of authors per pad that justifies splitting the ecosystem.
 
 ### Tier 4 — operational, not a code lever but valuable
 

From 7e32f1e0e3bb1dfdff1bfe8fddd1b66585811d9a Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Sat, 16 May 2026 13:25:53 +0100
Subject: [PATCH 15/15] docs(scaling-dive): split horizontal scaling into 6a
 (proxy shard) + 6b (cluster)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

My original Tier 3 option 6 conflated two distinct shapes the operator
audience would treat separately:

- 6a (reverse-proxy pad sharding): the L7-proxy answer the operator
  ecosystem already runs in production. No core changes; cost is
  deployment. Solves "more pads across many boxes".

- 6b (Node cluster module with sticky padId routing): single host,
  multi-worker. Solves "more pads per box". Pick this or Tier 2
  option 5 (worker_threads), not both — same problem shape, different
  isolation boundary.

Both are transparent to wire-protocol clients (desktop, mobile,
terminal/CLI, MCP) — same as the original note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/scaling-dive-2026-05.md | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
index 403b34636ff..8a9c5914b62 100644
--- a/docs/scaling-dive-2026-05.md
+++ b/docs/scaling-dive-2026-05.md
@@ -405,9 +405,13 @@ Concrete options for whoever picks this up next, ordered roughly by impact-per-t
 
 ### Tier 3 — large bets, mostly to know we have them
 
-6. **Sticky-session cluster mode.** Multi-process, pads partitioned across workers. *Cost: ~2-4 weeks PoC.* Same scaling shape as option 5 but coarser-grained and works without restructuring the in-process code. Doesn't help "one pad with N authors" either.
+6. **Horizontal scaling — two distinct shapes worth keeping separate:**
 
-   *Ecosystem impact:* mostly transparent to clients — they connect to the server URL as usual; the load balancer (or in-process router) handles stickiness. **Desktop apps that embed the server in-process** (Electron / Capacitor bundles a single Node process for one user) would either skip cluster mode entirely (single-user, no concurrency need) or bundle the cluster manager too — a per-deployment decision, not a protocol break. **Mobile**, **terminal / etherpad-cli**, and **MCP** clients are all wire-protocol consumers and unaffected.
+   - **6a. Reverse-proxy pad sharding (already known-working).** N independent etherpad processes / hosts behind an L7 proxy (nginx, HAProxy, Caddy) that hashes the `padId` from the URL path to a backend. Each backend is unaware of the others; pad ownership = which backend the hash lands on. *Cost: deployment work, no core changes.* **Solves "more pads across many boxes"** — already deployed successfully in operator-hosted setups. Trade-offs: cross-pad operations (global search, list-all-pads, admin) need either a shared DB layer or out-of-band coordination; otherwise per-pad work just works because every author hitting padX always lands on the same backend.
+
+   - **6b. In-process cluster mode (Node `cluster` module + sticky `padId` routing).** One primary process spawns N workers on one host; the primary routes incoming WebSocket upgrades by hashing `padId` to a worker. *Cost: ~2-4 weeks PoC.* **Solves "more pads per box"** — uses more cores on a single host, complementary to 6a. Same scope of work as Tier 2 option 5 (per-pad `worker_threads` isolation) but at the process boundary instead of the thread boundary. Worker_threads has cheaper IPC and shared module state; `cluster` has the simpler mental model of "each worker is an independent etherpad". Pick one; don't build both.
+
+   *Ecosystem impact (all of 6 above):* transparent to clients — they connect to the server URL as usual; the load balancer (6a) or primary process (6b) handles stickiness. **Desktop apps** that embed the server in-process (Electron / Capacitor bundle a single Node process for one user) skip both modes — single-user, no concurrency need. **Mobile**, **terminal / etherpad-cli**, and **MCP** clients are wire-protocol consumers and unaffected by either.
 
 7. **CRDT migration (Yjs / Automerge).** Native peer-to-peer scaling without a central coordinator. *Cost: months — but the headline cost is wire-protocol replacement, not the editor swap.* The Etherpad changeset format is the lingua franca for **everything that talks to a pad**: the web client, the **Electron / Capacitor desktop app** (embeds the web client), the **mobile app** (Phase 1 packaging merged 2026-05-11, wraps the same web client), **etherpad-cli** (printingpress.dev integration speaks changesets directly), **MCP servers** (any wrap pad ops via changeset semantics), and every server-side **plugin** that intercepts or transforms changesets. A CRDT migration replaces the changeset wire format with Yjs binary updates and requires parallel reimplementation in every one of those consumers — not a refactor, a fork. **Strongly anti-recommended** unless options 1-6 fail to deliver and there's a hard product requirement for thousands of authors per pad that justifies splitting the ecosystem.