Skip to content

A subinterpreter-forkserver spawning backend (i can't believe how fast we vibed this 😂 )#447

Draft
goodboy wants to merge 66 commits intosubint_fork_backendfrom
subint_forkserver_backend
Draft

A subinterpreter-forkserver spawning backend (i can't believe how fast we vibed this 😂 )#447
goodboy wants to merge 66 commits intosubint_fork_backendfrom
subint_forkserver_backend

Conversation

@goodboy
Copy link
Copy Markdown
Owner

@goodboy goodboy commented Apr 22, 2026

No description provided.

@goodboy goodboy force-pushed the subint_forkserver_backend branch from 418a7ca to 4425023 Compare April 23, 2026 15:44
@goodboy goodboy self-assigned this Apr 23, 2026
goodboy and others added 28 commits April 23, 2026 18:30
During the Phase A extraction of `trio_proc()` out of
`spawn._spawn` into its own submod, the
`debug.maybe_wait_for_debugger(child_in_debug=...)` call site in
the hard-reap `finally` got refactored from the original
`_runtime_vars.get('_debug_mode', ...)` (the fn parameter — the
dict that was constructed by the *parent* for the *child*'s
`SpawnSpec`) to `get_runtime_vars().get(...)` (a global getter that
returns the *parent's* live `_state`). Those are semantically
different — the first asks "is the child we just spawned in debug
mode?", the second asks "are *we* in debug mode?". Under
mixed-debug-mode trees the swap can incorrectly skip (or
unnecessarily delay) the debugger-lock wait during teardown.

Revert to the fn-parameter lookup and add an inline `NOTE` comment
calling out the distinction so it's harder to regress again.

Deats,
- `spawn/_trio.py`: `child_in_debug=get_runtime_vars().get(...)` →
  `child_in_debug=_runtime_vars.get(...)` at the
  `debug.maybe_wait_for_debugger(...)` call in the hard-reap block;
  add 4-line `NOTE` explaining the parent-vs-child distinction.
- `spawn/__init__.py`: drop trailing whitespace after the
  `'mp_forkserver'` docstring bullet.
- `ai/prompt-io/prompts/subints_spawner.md`: drop duplicated `with`
  in `"as with with subprocs"` prose (copilot grammar catch).

Review: PR #444 (Copilot)
#444 (review)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Spawner modules: split up subactor spawning  backends
Group `.claude/` ignores per-skill instead of a
flat list: `ai.skillz` symlinks, `/open-wkt`,
`/code-review-changes`, `/pr-msg`, `/commit-msg`.
Add missing symlink entries (`yt-url-lookup` ->
`resolve-conflicts`, `inter-skill-review`). Drop
stale `Claude worktrees` section (already covered
by `.claude/wkts/`).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New "Inspect last failures" section reads the pytest
`lastfailed` cache JSON directly — instant, no
collection overhead, and filters to `tests/`-prefixed
entries to avoid stale junk paths.

Also,
- add `jq` tool permission for `.pytest_cache/` files

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Rework section 3 from a worktree-only check into a
structured 3-step flow: detect active venv, interpret
results (Case A: active, B: none, C: worktree), then
run import + collection checks.

Deats,
- Case B prompts via `AskUserQuestion` when no venv
  is detected, offering `uv sync` or manual activate
- add `uv run` fallback section for envs where venv
  activation isn't practical
- new allowed-tools: `uv run python`, `uv run pytest`,
  `uv pip show`, `AskUserQuestion`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Land the scaffolding for a future sub-interpreter (PEP 734
`concurrent.interpreters`) actor spawn backend per issue #379. The
spawn flow itself is not yet implemented; `subint_proc()` raises a
placeholder `NotImplementedError` pointing at the tracking issue —
this commit only wires up the registry, the py-version gate, and
the harness.

Deats,
- bump `pyproject.toml` `requires-python` to `>=3.12, <3.15` and
  list the `3.14` classifier — the new stdlib
  `concurrent.interpreters` module only ships on 3.14
- extend `SpawnMethodKey = Literal[..., 'subint']`
- `try_set_start_method('subint')` grows a new `match` arm that
  feature-detects the stdlib module and raises `RuntimeError` with
  a clear banner on py<3.14
- `_methods` registers the new `subint_proc()` via the same
  bottom-of-module late-import pattern used for `._trio` / `._mp`

Also,
- new `tractor/spawn/_subint.py` — top-level `try: from concurrent
  import interpreters` guards `_has_subints: bool`; `subint_proc()`
  signature mirrors `trio_proc`/`mp_proc` so the Phase B.2 impl can
  drop in without touching the registry
- re-add `import sys` to `_spawn.py` (needed for the py-version msg
  in the gate-error)
- `_testing.pytest.pytest_configure` wraps `try_set_start_method()`
  in a `pytest.UsageError` handler so `--spawn-backend=subint` on
  py<3.14 prints a clean banner instead of a traceback

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Since we're devving subints we require the 3.14+ stdlib API
and a couple compiled libs don't support it yet, namely:
- `cffi`, which we're only using for the `.ipc._linux` eventfd
  stuff (now factored into `hotbaud` anyway).
- `greenback`, which requires `greenlet` which doesn't seem to be
  wheeled yet
  * on nixos the sdist build was failing due to lack of `g++` which
    i don't care to figure out rn since we don't need `.devx` stuff
    immediately for this subints prototype.
  * [ ] we still need to adjust any dependent suites to skip.

Adjust `test_ringbuf` to skip on import failure.

Also project wide,
- pin us to py 3.13+ in prep for last-2-minor-version policy.
- drop `msgspec>=0.20.0`, the first release with py3.14 support.
Replace the B.1 scaffold stub w/ a working spawn
flow driving PEP 734 sub-interpreters on dedicated
OS threads.

Deats,
- use private `_interpreters` C mod (not the public
  `concurrent.interpreters` API) to get `'legacy'`
  subint config — avoids PEP 684 C-ext compat
  issues w/ `msgspec` and other deps missing the
  `Py_mod_multiple_interpreters` slot
- bootstrap subint via code-string calling new
  `_actor_child_main()` from `_child.py` (shared
  entry for both CLI and subint backends)
- drive subint lifetime on an OS thread using
  `trio.to_thread.run_sync(_interpreters.exec, ..)`
- full supervision lifecycle mirrors `trio_proc`:
  `ipc_server.wait_for_peer()` → send `SpawnSpec`
  → yield `Portal` via `task_status.started()`
- graceful shutdown awaits the subint's inner
  `trio.run()` completing; cancel path sends
  `portal.cancel_actor()` then waits for thread
  join before `_interpreters.destroy()`

Also,
- extract `_actor_child_main()` from `_child.py`
  `__main__` block as callable entry shape bc the
  subint needs it for code-string bootstrap
- add `"subint"` to the `_runtime.py` spawn-method
  check so child accepts `SpawnSpec` over IPC

Prompt-IO: ai/prompt-io/claude/20260417T124437Z_5cd6df5_prompt_io.md

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Expand the comment block above the `_interpreters`
import explaining *why* we use the private C mod
over `concurrent.interpreters`: the public API only
exposes PEP 734's `'isolated'` config which breaks
`msgspec` (missing PEP 684 slot). Add reference
links to PEP 734, PEP 684, cpython sources, and
the msgspec upstream tracker (jcrist/msgspec#563).

Also,
- update error msgs in both `_spawn.py` and
  `_subint.py` to say "3.13+" (matching the actual
  `_interpreters` availability) instead of "3.14+".
- tweak the mod docstring to reflect py3.13+
  availability via the private C module.

Review: PR #444 (copilot-pull-request-reviewer)
#444

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
`trio.to_thread.run_sync(_interpreters.exec, ...)` runs `exec()` on
a cached worker thread — and when that thread is returned to the
cache after the subint's `trio.run()` exits, CPython still keeps
the subint's tstate attached to the (now idle) worker. Result: the
teardown `_interpreters.destroy(interp_id)` in the `finally` block
can block the parent's trio loop indefinitely, waiting for a tstate
release that only happens when the worker either picks up a new job
or exits.

Manifested as intermittent mid-suite hangs under
`--spawn-backend=subint` — caught by a
`faulthandler.dump_traceback_later()` showing the main thread stuck
in `_interpreters.destroy()` at `_subint.py:293` with only an idle
trio-cache worker as the other live thread.

Deats,
- drive the subint on a plain `threading.Thread` (not
  `trio.to_thread`) so the OS thread truly exits after
  `_interpreters.exec()` returns, releasing tstate and unblocking
  destroy
- signal `subint_exited.set()` back to the parent trio loop from
  the driver thread via `trio.from_thread.run_sync(...,
  trio_token=...)` — capture the token at `subint_proc` entry
- swallow `trio.RunFinishedError` in that signal path for the case
  where parent trio has already exited (proc teardown)
- in the teardown `finally`, off-load the sync
  `driver_thread.join()` to `trio.to_thread.run_sync` (cache thread
  w/ no subint tstate → safe) so we actually wait for the driver to
  exit before `_interpreters.destroy()`

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` session that produced
the `_subint.py` dedicated-thread fix (`26fb8206`).
Substantive bc the patch was entirely AI-generated;
raw log also preserves the CPython-internals
research informing Phase B.3 hard-kill work.

Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Unbounded `trio.CancelScope(shield=True)` at the
soft-kill and thread-join sites can wedge the parent
trio loop indefinitely when a stuck subint ignores
portal-cancel (e.g. bc the IPC channel is already
broken).

Deats,
- add `_HARD_KILL_TIMEOUT` (3s) module-level const
- wrap both shield sites with
  `trio.move_on_after()` so we abandon a stuck
  subint after the deadline
- flip driver thread to `daemon=True` so proc-exit
  also isn't blocked by a wedged subint
- pass `abandon_on_cancel=True` to
  `trio.to_thread.run_sync(driver_thread.join)`
  — load-bearing for `move_on_after` to actually
  fire
- log warnings when either timeout triggers
- improve `InterpreterError` log msg to explain
  the abandoned-thread scenario

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Bottle up the diagnostic primitives that actually cracked the
silent mid-suite hangs in the `subint` spawn-backend bringup (issue
there" session has them on the shelf instead of reinventing from
scratch.

Deats,
- `dump_on_hang(seconds, *, path)` — context manager wrapping
  `faulthandler.dump_traceback_later()`. Critical gotcha baked in:
  dumps go to a *file*, not `sys.stderr`, bc pytest's stderr
  capture silently eats the output and you can spend an hour
  convinced you're looking at the wrong thing
- `track_resource_deltas(label, *, writer)` — context manager
  logging per-block `(threading.active_count(),
  len(_interpreters.list_all()))` deltas; quickly rules out
  leak-accumulation theories when a suite progressively worsens (if
  counts don't grow, it's not a leak, look for a race on shared
  cleanup instead)
- `resource_delta_fixture(*, autouse, writer)` — factory returning
  a `pytest` fixture wrapping `track_resource_deltas` per-test; opt
  in by importing into a `conftest.py`. Kept as a factory (not a
  bare fixture) so callers own `autouse` / `writer` wiring

Also,
- export the three names from `tractor.devx`
- dep-free on py<3.13 (swallows `ImportError` for `_interpreters`)
- link back to the provenance in the module docstring (issue #379 /
  commit `26fb820`)

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The private `_interpreters` C module ships since 3.13, but that vintage
wedges under our `threading.Thread` + multi-trio usage pattern
—> `_interpreters.exec()` silently never makes progress. 3.14 fixes it.
So gate on the presence of the public `concurrent.interpreters` wrapper
(3.14+ only) even tho we still call into the private module at runtime.

Deats,
- `try_set_start_method('subint')` error msg + `_subint` module
  docstring/comments rewritten to document the 3.14 floor and why 3.13
  can't work.
- `_subint._has_subints` gate now imports `concurrent.interpreters` (not
  `_interpreters`) as the version sentinel.

Also, reshuffle `pyproject.toml` deps into
per-python-version `[tool.uv.dependency-groups]`:
- `subints` group: `msgspec>=0.21.0`, py>=3.14
- `eventfd` group: `cffi>=1.17.1`, py>=3.13,<3.14
- `sync_pause` group: `greenback`, py>=3.13,<3.14
  (was in `devx`; moved out bc no 3.14 yet)

Bump top-level `msgspec>=0.20.0` too.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Lock in the escape-hatch machinery added to `tractor.spawn._subint`
during the Phase B.2/B.3 bringup (issue #379) so future stdlib
regressions or our own refactors don't silently re-introduce the
mid-suite hangs.

Deats,
- `test_subint_happy_teardown`: baseline — spawn a subactor, one portal
  RPC, clean teardown. If this breaks, something's wrong unrelated to
  the hard-kill shields.
- `test_subint_non_checkpointing_child`: cancel a subactor stuck in
  a non-checkpointing Python loop (`threading.Event.wait()` releases the
  GIL but never inserts a trio checkpoint). Validates the bounded-shield
  + daemon-driver-thread combo abandons the thread after
    `_HARD_KILL_TIMEOUT`.

Every test is wrapped in `trio.fail_after()` for a deterministic
per-test wall-clock ceiling (an unbounded audit would defeat itself) and
arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump
— pytest's stderr capture swallows `faulthandler` output by default.

Gated via `pytest.importorskip('concurrent.interpreters')` and
a module-level skip when `--spawn-backend` isn't `'subint'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Classify and write up the two distinct hang modes hit during Phase
B subint bringup (issue #379) so future triage doesn't re-derive them
from scratch.

Deats, two new `ai/conc-anal/` docs,
- `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread
  + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills
  → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the
  wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on
  `msgspec` PEP 684 (jcrist/msgspec#563)

- `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on
  an orphaned IPC channel after subint teardown — no clean EOF delivered
  to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug
  to fix. Candidate fix: explicit parent-side channel abort in
  `subint_proc`'s hard-kill teardown

Cross-link the docs from their test reproducers,
- `test_stale_entry_is_deleted` (→ starvation class): wrap
  `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression
  captures a stack dump. Kept un- skipped so the dump file is
  inspectable

- `test_subint_non_checkpointing_child` (→ delivery class): extend
  docstring with a "KNOWN ISSUE" block pointing at the analysis

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint`
backend hang classes + arm `dump_on_hang`"). Substantive bc the two new
`ai/conc-anal/` docs were jointly authored — user framed the two-class
split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang;
claude drafted the prose and the test-side cross-linking comments.

`.raw.md` is in diff-ref mode — per-file pointers via `git diff
e92e3cd~1..e92e3cd -- <path>` rather than re-embedding content that
already lives in `git log -p`.

Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add a hard process-level wall-clock bound on the two
known-hanging subint-backend tests so an unattended
suite run can't wedge indefinitely in either of the
hang classes doc'd in `ai/conc-anal/`.

Deats,
- New `testing` dep: `pytest-timeout>=2.3`.
- `test_stale_entry_is_deleted`:
  `@pytest.mark.timeout(3, method='thread')`. The
  `method='thread'` choice is deliberate —
  `method='signal'` routes via `SIGALRM` which is
  starved by the same GIL-hostage path that drops
  `SIGINT` (see `subint_sigint_starvation_issue.md`),
  so it'd never actually fire in the starvation case.
- `test_subint_non_checkpointing_child`: same
  decorator, same reasoning (defense-in-depth over
  the inner `trio.fail_after(15)`).

At timeout, `pytest-timeout` hard-kills the pytest
process itself — that's the intended behavior here;
the alternative is the suite never returning.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add two more tests to the catalog in
`conc-anal/subint_sigint_starvation_issue.md` — same
signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver
threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe
→ silent SIGINT drop), different load patterns.

Deats,
- `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested
  actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie
  reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so
  the grandchild persists in its abandoned driver thread. Often only
  manifests under full-suite runs (earlier tests seed the
  abandoned-thread pool).

- `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors
  all go through teardown on the multierror. Bounded hard-kills run in
  parallel — so the total budget is ~3s, not 3s × 25. Leaves 25
  abandoned driver threads simultaneously alive, an extreme pressure
  multiplier. `strace` shows several successful `write(16, "\2", 1) = 1`
  (GIL round-robin IS giving main brief slices) before finally
  saturating with `EAGAIN`.

Also include a `pstree -snapt <pid>` capture showing
16+ live `{subint-driver[<interp_id>}` threads at the
moment of hang — the direct GIL-contender population.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...],
reason='...')` marker for backend-specific known-hang / -borked cases
— avoids scattering `@pytest.mark.skipif(lambda ...)` branches across
tests that misbehave under a particular `--spawn-backend`.

Deats,
- `pytest_configure()` registers the marker via
  `addinivalue_line('markers', ...)`.
- New `pytest_collection_modifyitems()` hook walks
  each collected item with `item.iter_markers(
  name='skipon_spawn_backend')`, checks whether the
  active `--spawn-backend` appears in `mark.args`, and
  if so injects a concrete `pytest.mark.skip(
  reason=...)`. `iter_markers()` makes the decorator
  work at function, class, or module (`pytestmark =
  [...]`) scope transparently.
- First matching mark wins; default reason is
  `f'Borked on --spawn-backend={backend!r}'` if the
  caller doesn't supply one.

Also, tighten type annotations on nearby `pytest`
integration points — `pytest_configure`, `debug_mode`,
`spawn_backend`, `tpt_protos`, `tpt_proto` — now taking
typed `pytest.Config` / `pytest.FixtureRequest` params.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Adopt the `@pytest.mark.skipon_spawn_backend('subint',
reason=...)` marker (a617b52) across the suites
reproducing the `subint` GIL-contention / starvation
hang classes doc'd in `ai/conc-anal/subint_*_issue.md`.

Deats,
- Module-level `pytestmark` on full-file-hanging suites:
  - `tests/test_cancellation.py`
  - `tests/test_inter_peer_cancellation.py`
  - `tests/test_pubsub.py`
  - `tests/test_shm.py`
- Per-test decorator where only one test in the file
  hangs:
  - `tests/discovery/test_registrar.py
    ::test_stale_entry_is_deleted` — replaces the
    inline `if start_method == 'subint': pytest.skip`
    branch with a declarative skip.
  - `tests/test_subint_cancellation.py
    ::test_subint_non_checkpointing_child`.
- A few per-test decorators are left commented-in-
  place as breadcrumbs for later finer-grained unskips.

Also, some nearby tidying in the affected files:
- Annotate loose fixture / test params
  (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in
  `tests/conftest.py`, `tests/devx/conftest.py`, and
  `tests/test_cancellation.py`.
- Normalize `"""..."""` → `'''...'''` docstrings per
  repo convention on a few touched tests.
- Add `timeout=6` / `timeout=10` to
  `@tractor_test(...)` on `test_cancel_infinite_streamer`
  and `test_some_cancels_all`.
- Drop redundant `spawn_backend` param from
  `test_cancel_via_SIGINT`; use `start_method` in the
  `'mp' in ...` check instead.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
goodboy added 17 commits April 23, 2026 18:48
The `subint_forkserver` backend's child runtime is trio-native (uses
`_trio_main` + receives `SpawnSpec` over IPC just like `trio`/`subint`),
so `tractor.devx.debug._tty_lock` works in those subactors. Wire the
runtime gates that historically hard-coded `_spawn_method == 'trio'` to
recognize this third backend.

Deats,
- new `_DEBUG_COMPATIBLE_BACKENDS` module-const in `tractor._root`
  listing the spawn backends whose subactor runtime is trio-native
  (`'trio'`, `'subint_forkserver'`). Both the enable-site
  (`_runtime_vars['_debug_mode'] = True`) and the cleanup-site reset
  key.
  off the same tuple — keep them in lockstep when adding backends
- `open_root_actor`'s `RuntimeError` for unsupported backends now
  reports the full compatible-set + the rejected method instead of the
  stale "only `trio`" msg.
- `runtime._runtime.Actor._from_parent`'s SpawnSpec-recv gate adds
  `'subint_forkserver'` to the existing `('trio', 'subint')` tuple
  — fork child-side runtime receives the same SpawnSpec IPC handshake as
  the others.
- `subint_forkserver_proc` child-target now passes
  `spawn_method='subint_forkserver'` (was hard-coded `'trio'`) so
  `Actor.pformat()` / log lines reflect the actual parent-side spawn
  mechanism rather than masquerading as plain `trio`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up to 72d1b90 (was prev commit adding `debug_mode` for
`subint_forkserver`): that commit wired the runtime-side
`subint_forkserver` SpawnSpec-recv gate in `Actor._from_parent`, but the
`subint_forkserver_proc` child-target was still passing
`spawn_method='trio'` to `_trio_main` — so `Actor.pformat()` / log lines
would report the subactor as plain `'trio'` instead of the actual
parent-side spawn mechanism. Flip the label to `'subint_forkserver'`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Also, some slight touchups in `.spawn._subint`.
Fork-based backends (esp. `subint_forkserver`) can
leak child actor processes on cancelled / SIGINT'd
test runs; the zombies keep the tractor default
registry (`127.0.0.1:1616` / `/tmp/registry@1616.sock`)
bound, so every subsequent session can't bind and
50+ unrelated tests fail with the same
`TooSlowError` / "address in use" signature. Document
the pre-flight + post-cancel check as a mandatory
step 4.

Deats,
- **primary signal**: `ss -tlnp | grep ':1616'` for a
  bound TCP registry listener — the authoritative
  check since :1616 is unique to our runtime
- `pgrep -af` scoped to `$(pwd)/py[0-9]*/bin/python.*
  _actor_child_main|subint-forkserv` for leftover
  actor/forkserver procs — scoped deliberately so we
  don't false-flag legit long-running tractor-
  embedding apps like `piker`
- `ls /tmp/registry@*.sock` for stale UDS sockets
- scoped cleanup recipe (SIGTERM + SIGKILL sweep
  using the same `$(pwd)/py*` pattern, UDS `rm -f`,
  re-verify) plus a fallback for when a zombie holds
  :1616 but doesn't match the pattern: `ss -tlnp` →
  kill by PID
- explicit false-positive warning calling out the
  `piker` case (`~/repos/piker/py*/bin/python3 -m
  tractor._child ...`) so a bare `pgrep` doesn't lead
  to nuking unrelated apps

Goal: short-circuit the "spelunking into test code"
rabbit-hole when the real cause is just a leaked PID
from a prior session, without collateral damage to
other tractor-embedding projects on the same box.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
New `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`
captures a descendant-leak surfaced while wiring
`subint_forkserver` into the full test matrix:
running `tests/test_cancellation.py` under
`--spawn-backend=subint_forkserver` reproducibly
leaks **exactly 5** `subint-forkserv` comm-named
child processes that survive session exit, each
holding a `LISTEN` on `:1616` (the tractor default
registry addr) — and therefore poisons every
subsequent test session that defaults to that addr.

Deats,
- TL;DR + ruled-out checks confirming the procs are
  ours (not piker / other tractor-embedding apps) —
  `/proc/$pid/cmdline` + cwd both resolve to this
  repo's `py314/` venv
- root cause: `_ForkedProc.kill()` is PID-scoped
  (plain `os.kill(SIGKILL)` to the direct child),
  not tree-scoped — grandchildren spawned during a
  multi-level cancel test get reparented to init and
  inherit the registry listen socket
- proposed fix directions ranked: (1) put each
  forkserver-spawned subactor in its own process-
  group (`os.setpgrp()` in fork-child) + tree-kill
  via `os.killpg(pgid, SIGKILL)` on teardown,
  (2) `PR_SET_CHILD_SUBREAPER` on root, (3) explicit
  `/proc/<pid>/task/*/children` walk. Vote: (1) —
  POSIX-standard, aligns w/ `start_new_session=True`
  semantics in `subprocess.Popen` / trio's
  `open_process`
- inline reproducer + cleanup recipe scoped to
  `$(pwd)/py314/bin/python.*pytest.*spawn-backend=
  subint_forkserver` so cleanup doesn't false-flag
  unrelated tractor procs (consistent w/
  `run-tests` skill's zombie-check guidance)

Stopgap hygiene fix (wiring `reg_addr` through the 5
leaky tests in `test_cancellation.py`) is incoming as
a follow-up — that one stops the blast radius, but
zombies still accumulate per-run until the real
tree-kill fix lands.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Stopgap companion to d012196 (`subint_forkserver`
test-cancellation leak doc): five tests in
`tests/test_cancellation.py` were running against the
default `:1616` registry, so any leaked
`subint-forkserv` descendant from a prior test holds
the port and blows up every subsequent run with
`TooSlowError` / "address in use". Thread the
session-unique `reg_addr` fixture through so each run
picks its own port — zombies can no longer poison
other tests (they'll only cross-contaminate whatever
happens to share their port, which is now nothing).

Deats,
- add `reg_addr: tuple` fixture param to:
  - `test_cancel_infinite_streamer`
  - `test_some_cancels_all`
  - `test_nested_multierrors`
  - `test_cancel_via_SIGINT`
  - `test_cancel_via_SIGINT_other_task`
- explicitly pass `registry_addrs=[reg_addr]` to the
  two `open_nursery()` calls that previously had no
  kwargs at all (in `test_cancel_via_SIGINT` and
  `test_cancel_via_SIGINT_other_task`)
- add bounded `@pytest.mark.timeout(7, method='thread')`
  to `test_nested_multierrors` so a hung run doesn't
  wedge the whole session

Still doesn't close the real leak — the
`subint_forkserver` backend's `_ForkedProc.kill()` is
PID-scoped not tree-scoped, so grandchildren survive
teardown regardless of registry port. This commit is
just blast-radius containment until that fix lands.
See `ai/conc-anal/
subint_forkserver_test_cancellation_leak_issue.md`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
The previous cleanup recipe went straight to
SIGTERM+SIGKILL, which hides bugs: tractor is
structured concurrent — `_trio_main` catches SIGINT
as an OS-cancel and cascades `Portal.cancel_actor`
over IPC to every descendant. So a graceful SIGINT
exercises the actual SC teardown path; if it hangs,
that's a real bug to file (the forkserver `:1616`
zombie was originally suspected to be one of these
but turned out to be a teardown gap in
`_ForkedProc.kill()` instead).

Deats,
- step 1: `pkill -INT` scoped to `$(pwd)/py*` — no
  sleep yet, just send the signal
- step 2: bounded wait loop (10 × 0.3s = ~3s) using
  `pgrep` to poll for exit. Loop breaks early on
  clean exit
- step 3: `pkill -9` only if graceful timed out, w/
  a logged escalation msg so it's obvious when SC
  teardown didn't complete
- step 4: same SIGINT-first ladder for the rare
  `:1616`-holding zombie that doesn't match the
  cmdline pattern (find PID via `ss -tlnp`, then
  `kill -INT NNNN; sleep 1; kill -9 NNNN`)
- steps 5-6: UDS-socket `rm -f` + re-verify
  unchanged

Goal: surface real teardown bugs through the test-
cleanup workflow instead of papering over them with
`-9`.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Major rewrite of
`subint_forkserver_test_cancellation_leak_issue.md`
after empirical investigation revealed the earlier
"descendant-leak + missing tree-kill" diagnosis
conflated two unrelated symptoms:

1. **5-zombie leak holding `:1616`** — turned out to
   be a self-inflicted cleanup bug: `pkill`-ing a bg
   pytest task (SIGTERM/SIGKILL, no SIGINT) skipped
   the SC graceful cancel cascade entirely. Codified
   the real fix — SIGINT-first ladder w/ bounded
   wait before SIGKILL — in e5e2afb (`run-tests`
   SKILL) and
   `feedback_sc_graceful_cancel_first.md`.
2. **`test_nested_multierrors[subint_forkserver]`
   hangs indefinitely** — the actual backend bug,
   and it's a deadlock not a leak.

Deats,
- new diagnosis: all 5 procs are kernel-`S` in
  `do_epoll_wait`; pytest-main's trio-cache workers
  are in `os.waitpid` waiting for children that are
  themselves waiting on IPC that never arrives —
  graceful `Portal.cancel_actor` cascade never
  reaches its targets
- tree-structure evidence: asymmetric depth across
  two identical `run_in_actor` calls — child 1
  (3 threads) spawns both its grandchildren; child 2
  (1 thread) never completes its first nursery
  `run_in_actor`. Smells like a race on fork-
  inherited state landing differently per spawn
  ordering
- new hypothesis: `os.fork()` from a subactor
  inherits the ROOT parent's IPC listener FDs
  transitively. Grandchildren end up with three
  overlapping FD sets (own + direct-parent + root),
  so IPC routing becomes ambiguous. Predicts bug
  scales with fork depth — matches reality: single-
  level spawn works, multi-level hangs
- ruled out: `_ForkedProc.kill()` tree-kill (never
  reaches hard-kill path), `:1616` contention (fixed
  by `reg_addr` fixture wiring), GIL starvation
  (each subactor has its own OS process+GIL),
  child-side KBI absorption (`_trio_main` only
  catches KBI at `trio.run()` callsite, reached
  only on trio-loop exit)
- four fix directions ranked: (1) blanket post-fork
  `closerange()`, (2) `FD_CLOEXEC` + audit,
  (3) targeted FD cleanup via `actor.ipc_server`
  handle, (4) `os.posix_spawn` w/ `file_actions`.
  Vote: (3) — surgical, doesn't break the "no exec"
  design of `subint_forkserver`
- standalone repro added (`spawn_and_error(breadth=
  2, depth=1)` under `trio.fail_after(20)`)
- stopgap: skip `test_nested_multierrors` + multi-
  level-spawn tests under the backend via
  `@pytest.mark.skipon_spawn_backend(...)` until
  fix lands

Killing the "tree-kill descendants" fix-direction
section: it addressed a bug that didn't exist.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Implements fix-direction (1)/blunt-close-all-FDs from
b71705b (`subint_forkserver` nested-cancel hang
diag), targeting the multi-level cancel-cascade
deadlock in
`test_nested_multierrors[subint_forkserver]`.

The diagnosis doc voted for surgical FD cleanup via
`actor.ipc_server` handle as the cleanest approach,
but going blunt is actually the right call: after
`os.fork()`, the child immediately enters
`_actor_child_main()` which opens its OWN IPC
sockets / wakeup-fd / epoll-fd / etc. — none of the
parent's FDs are needed. Closing everything except
stdio is safe AND defends against future
listener/IPC additions to the parent inheriting
silently into children.

Deats,
- new `_close_inherited_fds(keep={0,1,2}) -> int`
  helper. Linux fast-path enumerates `/proc/self/fd`;
  POSIX fallback uses `RLIMIT_NOFILE` range. Matches
  the stdlib `subprocess._posixsubprocess.close_fds`
  strategy. Returns close-count for sanity logging
- wire into `fork_from_worker_thread._worker()`'s
  post-fork child prelude — runs immediately after
  the pid-pipe `os.close(rfd/wfd)`, before the user
  `child_target` callable executes
- docstring cross-refs the diagnosis doc + spells
  out the FD-inheritance-cascade mechanism and why
  the close-all approach is safe for our spawn shape

Validation pending: re-run `test_nested_multierrors[subint_forkserver]`
to confirm the deadlock is gone.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two coordinated improvements to the `subint_forkserver` backend:

1. Replace `trio.to_thread.run_sync(os.waitpid, ...,
   abandon_on_cancel=False)` in `_ForkedProc.wait()`
   with `trio.lowlevel.wait_readable(pidfd)`. The
   prior version blocked a trio cache thread on a
   sync syscall — outer cancel scopes couldn't
   unwedge it when something downstream got stuck.
   Same pattern `trio.Process.wait()` and
   `proc_waiter` (the mp backend) already use.

2. Drop the `@pytest.mark.xfail(strict=True)` from
   `test_orphaned_subactor_sigint_cleanup_DRAFT` —
   the test now PASSES after 0cd0b63 (fork-child
   FD scrub). Same root cause as the nested-cancel
   hang: inherited IPC/trio FDs were poisoning the
   child's event loop. Closing them lets SIGINT
   propagation work as designed.

Deats,
- `_ForkedProc.__init__` opens a pidfd via
  `os.pidfd_open(pid)` (Linux 5.3+, Python 3.9+)
- `wait()` parks on `trio.lowlevel.wait_readable()`,
  then non-blocking `waitpid(WNOHANG)` to collect
  the exit status (correct since the pidfd signal
  IS the child-exit notification)
- `ChildProcessError` swallow handles the rare race
  where someone else reaps first
- pidfd closed after `wait()` completes (one-shot
  semantics) + `__del__` belt-and-braces for
  unexpected-teardown paths
- test docstring's `@xfail` block replaced with a
  `# NOTE` comment explaining the historical
  context + cross-ref to the conc-anal doc; test
  remains in place as a regression guard

The two changes are interdependent — the
cancellable `wait()` matters for the same nested-
cancel scenarios the FD scrub fixes, since the
original deadlock had trio cache workers wedged in
`os.waitpid` swallowing the outer cancel.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Completes the nested-cancel deadlock fix started in
0cd0b63 (fork-child FD scrub) and fe540d0 (pidfd-
cancellable wait). The remaining piece: the parent-
channel `process_messages` loop runs under
`shield=True` (so normal cancel cascades don't kill
it prematurely), and relies on EOF arriving when the
parent closes the socket to exit naturally.

Under exec-spawn backends (`trio_proc`, mp) that EOF
arrival is reliable — parent's teardown closes the
handler-task socket deterministically. But fork-
based backends like `subint_forkserver` share enough
process-image state that EOF delivery becomes racy:
the loop parks waiting for an EOF that only arrives
after the parent finishes its own teardown, but the
parent is itself blocked on `os.waitpid()` for THIS
actor's exit. Mutual wait → deadlock.

Deats,
- `async_main` stashes the cancel-scope returned by
  `root_tn.start(...)` for the parent-chan
  `process_messages` task onto the actor as
  `_parent_chan_cs`
- `Actor.cancel()`'s teardown path (after
  `ipc_server.cancel()` + `wait_for_shutdown()`)
  calls `self._parent_chan_cs.cancel()` to
  explicitly break the shield — no more waiting for
  EOF delivery, unwinding proceeds deterministically
  regardless of backend
- inline comments on both sites explain the mutual-
  wait deadlock + why the explicit cancel is
  backend-agnostic rather than a forkserver-specific
  workaround

With this + the prior two fixes, the
`subint_forkserver` nested-cancel cascade unwinds
cleanly end-to-end.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two-part stopgap for the still-hanging
`test_nested_multierrors[subint_forkserver]`:

1. Skip-mark the test via
   `@pytest.mark.skipon_spawn_backend('subint_forkserver',
   reason=...)` so it stops blocking the test
   matrix while the remaining bug is being chased.
   The reason string cross-refs the conc-anal doc
   for full context.

2. Update the conc-anal doc
   (`subint_forkserver_test_cancellation_leak_issue.md`) with the
   empirical state after the three nested- cancel fix commits
   (`0cd0b633` FD scrub + `fe540d02` pidfd wait + `57935804` parent-chan
   shield break) landed, narrowing the remaining hang from "everything
   broken" to "peer-channel loops don't exit on `service_tn` cancel".

Deats from the DIAGDEBUG instrumentation pass,
- 80 `process_messages` ENTERs, 75 EXITs → 5 stuck
- ALL 40 `shield=True` ENTERs matched EXIT — the
  `_parent_chan_cs.cancel()` wiring from `57935804`
  works as intended for shielded loops.
- the 5 stuck loops are all `shield=False` peer-
  channel handlers in `handle_stream_from_peer`
  (inbound connections handled by
  `stream_handler_tn`, which IS `service_tn` in the
  current config).
- after `_parent_chan_cs.cancel()` fires, NEW
  shielded loops appear on the session reg_addr
  port — probably discovery-layer reconnection;
  doesn't block teardown but indicates the cascade
  has more moving parts than expected.

The remaining unknown: why don't the 5 peer-channel loops exit when
`service_tn.cancel_scope.cancel()` fires? They're not shielded, they're
inside the service_tn scope, a standard cancel should propagate through.
Some fork-config-specific divergence keeps them alive. Doc lists three
follow-up experiments (stackscope dump, side-by-side `trio_proc`
comparison, audit of the `tractor/ipc/_server.py:448` `except
trio.Cancelled:` path).

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Two new sections in
`subint_forkserver_test_cancellation_leak_issue.md`
documenting continued investigation of the
`test_nested_multierrors[subint_forkserver]` peer-
channel-loop hang:

1. **"Attempted fix (DID NOT work) — hypothesis
   (3)"**: tried sync-closing peer channels' raw
   socket fds from `_serve_ipc_eps`'s finally block
   (iterate `server._peers`, `_chan._transport.
   stream.socket.close()`). Theory was that sync
   close would propagate as `EBADF` /
   `ClosedResourceError` into the stuck
   `recv_some()` and unblock it. Result: identical
   hang. Either trio holds an internal fd
   reference that survives external close, or the
   stuck recv isn't even the root blocker. Either
   way: ruled out, experiment reverted, skip-mark
   restored.
2. **"Aside: `-s` flag changes behavior for peer-
   intensive tests"**: noticed
   `test_context_stream_semantics.py` under
   `subint_forkserver` hangs with default
   `--capture=fd` but passes with `-s`
   (`--capture=no`). Working hypothesis: subactors
   inherit pytest's capture pipe (fds 1,2 — which
   `_close_inherited_fds` deliberately preserves);
   verbose subactor logging fills the buffer,
   writes block, deadlock. Fix direction (if
   confirmed): redirect subactor stdout/stderr to
   `/dev/null` or a file in `_actor_child_main`.
   Not a blocker on the main investigation;
   deserves its own mini-tracker.

Both sections are diagnosis-only — no code changes
in this commit.

(this patch was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Three places that previously swallowed exceptions silently now log via
`log.exception()` so they surface in the runtime log when something
weird happens — easier to track down sneaky failures in the
fork-from-worker-thread / subint-bootstrap primitives.

Deats,
- `_close_inherited_fds()`: post-fork child's per-fd `os.close()`
  swallow now logs the fd that failed to close. The comment notes the
  expected failure modes (already-closed-via-listdir-race,
  otherwise-unclosable) — both still fine to ignore semantically, but
  worth flagging in the log.
- `fork_from_worker_thread()` parent-side timeout branch: the
  `os.close(rfd)` + `os.close(wfd)` cleanup now logs each pipe-fd close
  failure separately before raising the `worker thread didn't return`
  RuntimeError.
- `run_subint_in_worker_thread._drive()`: when
  `_interpreters.exec(interp_id, bootstrap)` raises a `BaseException`,
  log the full call signature (interp_id + bootstrap) along with the
  captured exception, before stashing into `err` for the outer caller.

Behavior unchanged — only adds observability.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
# Revisit `subint_forkserver` thread-cache constraints once msgspec PEP 684 support lands

Follow-up tracker for cleanup work gated on the msgspec
PEP 684 adoption upstream ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this to the new subints compat issue we make @ msgspec.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third diagnostic pass on
`test_nested_multierrors[subint_forkserver]` hang.
Two prior hypotheses ruled out + a new, more
specific deadlock shape identified.

Ruled out,
- **capture-pipe fill** (`-s` flag changes test):
  retested explicitly — `test_nested_multierrors`
  hangs identically with and without `-s`. The
  earlier observation was likely a competing
  pytest process I had running in another session
  holding registry state
- **stuck peer-chan recv that cancel can't
  break**: pivot from the prior pass. With
  `handle_stream_from_peer` instrumented at ENTER
  / `except trio.Cancelled:` / finally: 40
  ENTERs, ZERO `trio.Cancelled` hits. Cancel never
  reaches those tasks at all — the recvs are
  fine, nothing is telling them to stop

Actual deadlock shape: multi-level mutual wait.

    root              blocks on spawner.wait()
      spawner         blocks on grandchild.wait()
        grandchild    blocks on errorer.wait()
          errorer     Actor.cancel() ran, but proc
                      never exits

`Actor.cancel()` fired in 12 PIDs — but NOT in
root + 2 direct spawners. Those 3 have peer
handlers stuck because their own `Actor.cancel()`
never runs, which only runs when the enclosing
`tractor.open_nursery()` exits, which waits on
`_ForkedProc.wait()` for the child pidfd to
signal, which only signals when the child
process fully exits.

Refined question: **why does an errorer process
not exit after its `Actor.cancel()` completes?**
Three hypotheses (unverified):
1. `_parent_chan_cs.cancel()` fires but the
   shielded loop's recv is stuck in a way cancel
   still can't break
2. `async_main`'s post-cancel unwind has other
   tasks in `root_tn` awaiting something that
   never arrives (e.g. outbound IPC reply)
3. `os._exit(rc)` in `_worker` never runs because
   `_child_target` never returns

Next-session probes (priority order):
1. instrument `_worker`'s fork-child branch —
   confirm whether `child_target()` returns /
   `os._exit(rc)` is reached for errorer PIDs
2. instrument `async_main`'s final unwind — see
   which await in teardown doesn't complete
3. compare under `trio_proc` backend at the
   equivalent level to spot divergence

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy force-pushed the subint_forkserver_backend branch from 7753838 to ab86f76 Compare April 24, 2026 01:26
Fourth diagnostic pass — instrument `_worker`'s
fork-child branch (`pre child_target()` / `child_
target RETURNED rc=N` / `about to os._exit(rc)`)
and `_trio_main` boundaries (`about to trio.run` /
`trio.run RETURNED NORMALLY` / `FINALLY`). Test
config: depth=1/breadth=2 = 1 root + 14 forked =
15 actors total.

Fresh-run results,
- **9 processes complete the full flow**:
  `trio.run RETURNED NORMALLY` → `child_target
  RETURNED rc=0` → `os._exit(0)`. These are tree
  LEAVES (errorers) plus their direct parents
  (depth-0 spawners) — they actually exit
- **5 processes stuck INSIDE `trio.run(trio_
  main)`**: hit "about to trio.run" but never
  see "trio.run RETURNED NORMALLY". These are
  root + top-level spawners + one intermediate

The deadlock is in `async_main` itself, NOT the
peer-channel loops. Specifically, the outer
`async with root_tn:` in `async_main` never exits
for the 5 stuck actors, so the cascade wedges:

    trio.run never returns
      → _trio_main finally never runs
        → _worker never reaches os._exit(rc)
          → process never dies
            → parent's _ForkedProc.wait() blocks
              → parent's nursery hangs
                → parent's async_main hangs
                  → (recurse up)

The precise new question: **what task in the 5
stuck actors' `async_main` never completes?**
Candidates:
1. shielded parent-chan `process_messages` task
   in `root_tn` — but we cancel it via
   `_parent_chan_cs.cancel()` in `Actor.cancel()`,
   which only runs during
   `open_root_actor.__aexit__`, which itself runs
   only after `async_main`'s outer unwind — which
   doesn't happen. So the shield isn't broken in
   this path.
2. `actor_nursery._join_procs.wait()` or similar
   inline in the backend `*_proc` flow.
3. `_ForkedProc.wait()` on a grandchild that DID
   exit — but pidfd_open watch didn't fire (race
   between `pidfd_open` and the child exiting?).

Most specific next probe: add DIAG around
`_ForkedProc.wait()` enter/exit to see whether
pidfd-based wait returns for every grandchild
exit. If a stuck parent's `_ForkedProc.wait()`
never returns despite its child exiting → pidfd
mechanism has a race bug under nested forkserver.

Asymmetry observed in the cascade tree: some d=0
spawners exit cleanly, others stick, even though
they started identically. Not purely depth-
determined — some race condition in nursery
teardown when multiple siblings error
simultaneously.

No code changes — diagnosis-only.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Follow-up tracker for cleanup work gated on the msgspec
PEP 684 adoption upstream ([jcrist/msgspec#563](https://github.com/jcrist/msgspec/issues/563)).

Context — why this exists
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can lose the threading.Thread once we can use isolated (GIL) subints..

goodboy added 7 commits April 23, 2026 22:34
Fifth diagnostic pass pinpointed the hang to
`async_main`'s finally block — every stuck actor
reaches `FINALLY ENTER` but never `RETURNING`.
Specifically `await ipc_server.wait_for_no_more_
peers()` never returns when a peer-channel handler
is stuck: the `_no_more_peers` Event is set only
when `server._peers` empties, and stuck handlers
keep their channels registered.

Wrap the call in `trio.move_on_after(3.0)` + a
warning-log on timeout that records the still-
connected peer count. 3s is enough for any
graceful cancel-ack round-trip; beyond that we're
in bug territory and need to proceed with local
teardown so the parent's `_ForkedProc.wait()` can
unblock. Defensive-in-depth regardless of the
underlying bug — a local finally shouldn't block
on remote cooperation forever.

Verified: with this fix, ALL 15 actors reach
`async_main: RETURNING` (up from 10/15 before).

Test still hangs past 45s though — there's at
least one MORE unbounded wait downstream of
`async_main`. Candidates enumerated in the doc
update (`open_root_actor` finally /
`actor.cancel()` internals / trio.run bg tasks /
`_serve_ipc_eps` finally). Skip-mark stays on
`test_nested_multierrors[subint_forkserver]`.

Also updates
`subint_forkserver_test_cancellation_leak_issue.md`
with the new pinpoint + summary of the 6-item
investigation win list:
1. FD hygiene fix (`_close_inherited_fds`) —
   orphan-SIGINT closed
2. pidfd-based `_ForkedProc.wait` — cancellable
3. `_parent_chan_cs` wiring — shielded parent-chan
   loop now breakable
4. `wait_for_no_more_peers` bound — THIS commit
5. Ruled-out hypotheses: tree-kill missing, stuck
   socket recv, capture-pipe fill (all wrong)
6. Remaining unknown: at least one more unbounded
   wait in the teardown cascade above `async_main`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Sixth and final diagnostic pass — after all 4
cascade fixes landed (FD hygiene, pidfd wait,
`_parent_chan_cs` wiring, bounded peer-clear), the
actual last gate on
`test_nested_multierrors[subint_forkserver]`
turned out to be **pytest's default
`--capture=fd` stdout/stderr capture**, not
anything in the runtime cascade.

Empirical result: `pytest -s` → test PASSES in
6.20s. Default `--capture=fd` → hangs forever.

Mechanism: pytest replaces the parent's fds 1,2
with pipe write-ends it reads from. Fork children
inherit those pipes (since `_close_inherited_fds`
correctly preserves stdio). The error-propagation
cascade in a multi-level cancel test generates
7+ actors each logging multiple `RemoteActorError`
/ `ExceptionGroup` tracebacks — enough output to
fill Linux's 64KB pipe buffer. Writes block,
subactors can't progress, processes don't exit,
`_ForkedProc.wait` hangs.

Self-critical aside: I earlier tested w/ and w/o
`-s` and both hung, concluding "capture-pipe
ruled out". That was wrong — at that time fixes
1-4 weren't all in place, so the test was
failing at deeper levels long before reaching
the "produce lots of output" phase. Once the
cascade could actually tear down cleanly, enough
output flowed to hit the pipe limit. Order-of-
operations mistake: ruling something out based
on a test that was failing for a different
reason.

Deats,
- `subint_forkserver_test_cancellation_leak_issue
  .md`: new section "Update — VERY late: pytest
  capture pipe IS the final gate" w/ DIAG timeline
  showing `trio.run` fully returns, diagnosis of
  pipe-fill mechanism, retrospective on the
  earlier wrong ruling-out, and fix direction
  (redirect subactor stdout/stderr to `/dev/null`
  in fork-child prelude, conditional on
  pytest-detection or opt-in flag)
- `tests/test_cancellation.py`: skip-mark reason
  rewritten to describe the capture-pipe gate
  specifically; cross-refs the new doc section
- `tests/spawn/test_subint_forkserver.py`: the
  orphan-SIGINT test regresses back to xfail.
  Previously passed after the FD-hygiene fix,
  but the new `wait_for_no_more_peers(
  move_on_after=3.0)` bound in `async_main`'s
  teardown added up to 3s latency, pushing
  orphan-subactor exit past the test's 10s poll
  window. Real fix: faster orphan-side teardown
  OR extend poll window to 15s

No runtime code changes in this commit — just
test-mark adjustments + doc wrap-up.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Encode the hard-won lesson from the forkserver
cancel-cascade investigation into two skill docs
so future sessions grep-find it before spelunking
into trio internals.

Deats,
- `.claude/skills/conc-anal/SKILL.md`:
  - new "Unbounded waits in cleanup paths"
    section — rule: bound every `await X.wait()`
    in cleanup paths with `trio.move_on_after()`
    unless the setter is unconditionally
    reachable. Recent example:
    `ipc_server.wait_for_no_more_peers()` in
    `async_main`'s finally (was unbounded,
    deadlocked when any peer handler stuck)
  - new "The capture-pipe-fill hang pattern"
    section — mechanism, grep-pointers to the
    existing `conftest.py` guards (`tests/conftest
    .py:258`, `:316`), cross-ref to the full
    post-mortem doc, and the grep-note: "if a
    multi-subproc tractor test hangs, `pytest -s`
    first, conc-anal second"
- `.claude/skills/run-tests/SKILL.md`: new
  "Section 9: The pytest-capture hang pattern
  (CHECK THIS FIRST)" with symptom / cause /
  pre-existing guards to grep / three-step debug
  recipe (try `-s`, lower loglevel, redirect
  stdout/stderr) / signature of this bug vs. a
  real code hang / historical reference

Cost several investigation sessions before the
capture-pipe issue surfaced — it was masked by
deeper cascade deadlocks. Once the cascades were
fixed, the tree tore down enough to generate
pipe-filling log volume. Lesson: **grep this
pattern first when any multi-subproc tractor test
hangs under default pytest but passes with `-s`.**

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Lands the capture-pipe workaround from the prior cluster of diagnosis
commits: switch pytest's `--capture` mode from the default `fd`
(redirects fd 1,2 to temp files, which fork children inherit and can
deadlock writing into) to `sys` (only `sys.stdout` / `sys.stderr` — fd
1,2 left alone).

Trade-off documented inline in `pyproject.toml`:
- LOST: per-test attribution of raw-fd output (C-ext writes,
  `os.write(2, ...)`, subproc stdout). Still goes to terminal / CI
  capture, just not per-test-scoped in the failure report.
- KEPT: `print()` + `logging` capture per-test (tractor's logger uses
  `sys.stderr`).
- KEPT: `pytest -s` debugging behavior.

This allows us to re-enable `test_nested_multierrors` without
skip-marking + clears the class of pytest-capture-induced hangs for any
future fork-based backend tests.

Deats,
- `pyproject.toml`: `'--capture=sys'` added to `addopts` w/ ~20 lines of
  rationale comment cross-ref'ing the post-mortem doc

- `test_cancellation`: drop `skipon_spawn_backend('subint_forkserver')`
  from `test_nested_ multierrors` — no longer needed.
  * file-level `pytestmark` covers any residual.

- `tests/spawn/test_subint_forkserver.py`: orphan-SIGINT test's xfail
  mark loosened from `strict=True` to `strict=False` + reason rewritten.
  * it passes in isolation but is session-env-pollution sensitive
    (leftover subactor PIDs competing for ports / inheriting harness
    FDs).
  * tolerate both outcomes until suite isolation improves.

- `test_shm`: extend the existing
  `skipon_spawn_backend('subint', ...)` to also skip
  `'subint_forkserver'`.
  * Different root cause from the cancel-cascade class:
    `multiprocessing.SharedMemory`'s `resource_tracker` + internals
    assume fresh- process state, don't survive fork-without-exec cleanly

- `tests/discovery/test_registrar.py`: bump timeout 3→7s on one test
  (unrelated to forkserver; just a flaky-under-load bump).

- `tractor.spawn._subint_forkserver`: inline comment-only future-work
  marker right before `_actor_child_main()` describing the planned
  conditional stdout/stderr-to-`/dev/null` redirect for cases where
  `--capture=sys` isn't enough (no code change — the redirect logic
  itself is deferred).

EXTRA NOTEs
-----------
The `--capture=sys` approach is the minimum- invasive fix: just a pytest
ini change, no runtime code change, works for all fork-based backends,
trade-offs well-understood (terminal-level capture still happens, just
not pytest's per-test attribution of raw-fd output).

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Which is for sure true on py3.14+ rn since `greenlet` didn't want to
build for us (yet).
Continues the hygiene pattern from de60167 (cancel tests) into
`tests/test_infected_asyncio.py`: many tests here were calling
`tractor.open_nursery()` w/o `registry_addrs=[reg_addr]` and thus racing
on the default `:1616` registry across sessions. Thread the
session-unique `reg_addr` through so leaked or slow-to-teardown
subactors from a prior test can't cross-pollute.

Deats,
- add `registry_addrs=[reg_addr]` to `open_nursery()`
  calls in suite where missing.
- `test_sigint_closes_lifetime_stack`:
  - add `reg_addr`, `debug_mode`, `start_method`
    fixture params
  - `delay` now reads the `debug_mode` param directly
    instead of calling `tractor.debug_mode()` (fires
    slightly earlier in the test lifecycle)
  - sanity assert `if debug_mode: assert
    tractor.debug_mode()` after nursery open
  - new print showing SIGINT target
    (`send_sigint_to` + resolved pid)
  - catch `trio.TooSlowError` around
    `ctx.wait_for_result()` and conditionally
    `pytest.xfail` when `send_sigint_to == 'child'
    and start_method == 'subint_forkserver'` — the
    known orphan-SIGINT limitation tracked in
    `ai/conc-anal/subint_forkserver_orphan_sigint_hang_issue.md`
- parametrize id typo fix: `'just_trio_slee'` → `'just_trio_sleep'`

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant