A subinterpreter-in-thread spawning backend by goodboy · Pull Request #446 · goodboy/tractor

goodboy · 2026-04-21T17:39:19Z

No description provided.

Group `.claude/` ignores per-skill instead of a flat list: `ai.skillz` symlinks, `/open-wkt`, `/code-review-changes`, `/pr-msg`, `/commit-msg`. Add missing symlink entries (`yt-url-lookup` -> `resolve-conflicts`, `inter-skill-review`). Drop stale `Claude worktrees` section (already covered by `.claude/wkts/`). (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

New "Inspect last failures" section reads the pytest `lastfailed` cache JSON directly — instant, no collection overhead, and filters to `tests/`-prefixed entries to avoid stale junk paths. Also, - add `jq` tool permission for `.pytest_cache/` files (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Rework section 3 from a worktree-only check into a structured 3-step flow: detect active venv, interpret results (Case A: active, B: none, C: worktree), then run import + collection checks. Deats, - Case B prompts via `AskUserQuestion` when no venv is detected, offering `uv sync` or manual activate - add `uv run` fallback section for envs where venv activation isn't practical - new allowed-tools: `uv run python`, `uv run pytest`, `uv pip show`, `AskUserQuestion` (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Land the scaffolding for a future sub-interpreter (PEP 734 `concurrent.interpreters`) actor spawn backend per issue #379. The spawn flow itself is not yet implemented; `subint_proc()` raises a placeholder `NotImplementedError` pointing at the tracking issue — this commit only wires up the registry, the py-version gate, and the harness. Deats, - bump `pyproject.toml` `requires-python` to `>=3.12, <3.15` and list the `3.14` classifier — the new stdlib `concurrent.interpreters` module only ships on 3.14 - extend `SpawnMethodKey = Literal[..., 'subint']` - `try_set_start_method('subint')` grows a new `match` arm that feature-detects the stdlib module and raises `RuntimeError` with a clear banner on py<3.14 - `_methods` registers the new `subint_proc()` via the same bottom-of-module late-import pattern used for `._trio` / `._mp` Also, - new `tractor/spawn/_subint.py` — top-level `try: from concurrent import interpreters` guards `_has_subints: bool`; `subint_proc()` signature mirrors `trio_proc`/`mp_proc` so the Phase B.2 impl can drop in without touching the registry - re-add `import sys` to `_spawn.py` (needed for the py-version msg in the gate-error) - `_testing.pytest.pytest_configure` wraps `try_set_start_method()` in a `pytest.UsageError` handler so `--spawn-backend=subint` on py<3.14 prints a clean banner instead of a traceback (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Since we're devving subints we require the 3.14+ stdlib API and a couple compiled libs don't support it yet, namely: - `cffi`, which we're only using for the `.ipc._linux` eventfd stuff (now factored into `hotbaud` anyway). - `greenback`, which requires `greenlet` which doesn't seem to be wheeled yet * on nixos the sdist build was failing due to lack of `g++` which i don't care to figure out rn since we don't need `.devx` stuff immediately for this subints prototype. * [ ] we still need to adjust any dependent suites to skip. Adjust `test_ringbuf` to skip on import failure. Also project wide, - pin us to py 3.13+ in prep for last-2-minor-version policy. - drop `msgspec>=0.20.0`, the first release with py3.14 support.

Replace the B.1 scaffold stub w/ a working spawn flow driving PEP 734 sub-interpreters on dedicated OS threads. Deats, - use private `_interpreters` C mod (not the public `concurrent.interpreters` API) to get `'legacy'` subint config — avoids PEP 684 C-ext compat issues w/ `msgspec` and other deps missing the `Py_mod_multiple_interpreters` slot - bootstrap subint via code-string calling new `_actor_child_main()` from `_child.py` (shared entry for both CLI and subint backends) - drive subint lifetime on an OS thread using `trio.to_thread.run_sync(_interpreters.exec, ..)` - full supervision lifecycle mirrors `trio_proc`: `ipc_server.wait_for_peer()` → send `SpawnSpec` → yield `Portal` via `task_status.started()` - graceful shutdown awaits the subint's inner `trio.run()` completing; cancel path sends `portal.cancel_actor()` then waits for thread join before `_interpreters.destroy()` Also, - extract `_actor_child_main()` from `_child.py` `__main__` block as callable entry shape bc the subint needs it for code-string bootstrap - add `"subint"` to the `_runtime.py` spawn-method check so child accepts `SpawnSpec` over IPC Prompt-IO: ai/prompt-io/claude/20260417T124437Z_5cd6df5_prompt_io.md (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Expand the comment block above the `_interpreters` import explaining *why* we use the private C mod over `concurrent.interpreters`: the public API only exposes PEP 734's `'isolated'` config which breaks `msgspec` (missing PEP 684 slot). Add reference links to PEP 734, PEP 684, cpython sources, and the msgspec upstream tracker (jcrist/msgspec#563). Also, - update error msgs in both `_spawn.py` and `_subint.py` to say "3.13+" (matching the actual `_interpreters` availability) instead of "3.14+". - tweak the mod docstring to reflect py3.13+ availability via the private C module. Review: PR #444 (copilot-pull-request-reviewer) #444 (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

`trio.to_thread.run_sync(_interpreters.exec, ...)` runs `exec()` on a cached worker thread — and when that thread is returned to the cache after the subint's `trio.run()` exits, CPython still keeps the subint's tstate attached to the (now idle) worker. Result: the teardown `_interpreters.destroy(interp_id)` in the `finally` block can block the parent's trio loop indefinitely, waiting for a tstate release that only happens when the worker either picks up a new job or exits. Manifested as intermittent mid-suite hangs under `--spawn-backend=subint` — caught by a `faulthandler.dump_traceback_later()` showing the main thread stuck in `_interpreters.destroy()` at `_subint.py:293` with only an idle trio-cache worker as the other live thread. Deats, - drive the subint on a plain `threading.Thread` (not `trio.to_thread`) so the OS thread truly exits after `_interpreters.exec()` returns, releasing tstate and unblocking destroy - signal `subint_exited.set()` back to the parent trio loop from the driver thread via `trio.from_thread.run_sync(..., trio_token=...)` — capture the token at `subint_proc` entry - swallow `trio.RunFinishedError` in that signal path for the case where parent trio has already exited (proc teardown) - in the teardown `finally`, off-load the sync `driver_thread.join()` to `trio.to_thread.run_sync` (cache thread w/ no subint tstate → safe) so we actually wait for the driver to exit before `_interpreters.destroy()` (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Log the `claude-opus-4-7` session that produced the `_subint.py` dedicated-thread fix (`26fb8206`). Substantive bc the patch was entirely AI-generated; raw log also preserves the CPython-internals research informing Phase B.3 hard-kill work. Prompt-IO: ai/prompt-io/claude/20260418T042526Z_26fb820_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Unbounded `trio.CancelScope(shield=True)` at the soft-kill and thread-join sites can wedge the parent trio loop indefinitely when a stuck subint ignores portal-cancel (e.g. bc the IPC channel is already broken). Deats, - add `_HARD_KILL_TIMEOUT` (3s) module-level const - wrap both shield sites with `trio.move_on_after()` so we abandon a stuck subint after the deadline - flip driver thread to `daemon=True` so proc-exit also isn't blocked by a wedged subint - pass `abandon_on_cancel=True` to `trio.to_thread.run_sync(driver_thread.join)` — load-bearing for `move_on_after` to actually fire - log warnings when either timeout triggers - improve `InterpreterError` log msg to explain the abandoned-thread scenario (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Bottle up the diagnostic primitives that actually cracked the silent mid-suite hangs in the `subint` spawn-backend bringup (issue there" session has them on the shelf instead of reinventing from scratch. Deats, - `dump_on_hang(seconds, *, path)` — context manager wrapping `faulthandler.dump_traceback_later()`. Critical gotcha baked in: dumps go to a *file*, not `sys.stderr`, bc pytest's stderr capture silently eats the output and you can spend an hour convinced you're looking at the wrong thing - `track_resource_deltas(label, *, writer)` — context manager logging per-block `(threading.active_count(), len(_interpreters.list_all()))` deltas; quickly rules out leak-accumulation theories when a suite progressively worsens (if counts don't grow, it's not a leak, look for a race on shared cleanup instead) - `resource_delta_fixture(*, autouse, writer)` — factory returning a `pytest` fixture wrapping `track_resource_deltas` per-test; opt in by importing into a `conftest.py`. Kept as a factory (not a bare fixture) so callers own `autouse` / `writer` wiring Also, - export the three names from `tractor.devx` - dep-free on py<3.13 (swallows `ImportError` for `_interpreters`) - link back to the provenance in the module docstring (issue #379 / commit `26fb820`) (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

The private `_interpreters` C module ships since 3.13, but that vintage wedges under our `threading.Thread` + multi-trio usage pattern —> `_interpreters.exec()` silently never makes progress. 3.14 fixes it. So gate on the presence of the public `concurrent.interpreters` wrapper (3.14+ only) even tho we still call into the private module at runtime. Deats, - `try_set_start_method('subint')` error msg + `_subint` module docstring/comments rewritten to document the 3.14 floor and why 3.13 can't work. - `_subint._has_subints` gate now imports `concurrent.interpreters` (not `_interpreters`) as the version sentinel. Also, reshuffle `pyproject.toml` deps into per-python-version `[tool.uv.dependency-groups]`: - `subints` group: `msgspec>=0.21.0`, py>=3.14 - `eventfd` group: `cffi>=1.17.1`, py>=3.13,<3.14 - `sync_pause` group: `greenback`, py>=3.13,<3.14 (was in `devx`; moved out bc no 3.14 yet) Bump top-level `msgspec>=0.20.0` too. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Lock in the escape-hatch machinery added to `tractor.spawn._subint` during the Phase B.2/B.3 bringup (issue #379) so future stdlib regressions or our own refactors don't silently re-introduce the mid-suite hangs. Deats, - `test_subint_happy_teardown`: baseline — spawn a subactor, one portal RPC, clean teardown. If this breaks, something's wrong unrelated to the hard-kill shields. - `test_subint_non_checkpointing_child`: cancel a subactor stuck in a non-checkpointing Python loop (`threading.Event.wait()` releases the GIL but never inserts a trio checkpoint). Validates the bounded-shield + daemon-driver-thread combo abandons the thread after `_HARD_KILL_TIMEOUT`. Every test is wrapped in `trio.fail_after()` for a deterministic per-test wall-clock ceiling (an unbounded audit would defeat itself) and arms `tractor.devx.dump_on_hang()` so a hang captures a stack dump — pytest's stderr capture swallows `faulthandler` output by default. Gated via `pytest.importorskip('concurrent.interpreters')` and a module-level skip when `--spawn-backend` isn't `'subint'`. (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Classify and write up the two distinct hang modes hit during Phase B subint bringup (issue #379) so future triage doesn't re-derive them from scratch. Deats, two new `ai/conc-anal/` docs, - `subint_sigint_starvation_issue.md`: abandoned legacy-subint thread + shared GIL → main trio loop starves → signal-wakeup-fd pipe fills → `SIGINT` silently dropped (`strace` shows `write() = EAGAIN` on the wakeup-fd). Un- Ctrl-C-able. Structurally a CPython limit; blocked on `msgspec` PEP 684 (jcrist/msgspec#563) - `subint_cancel_delivery_hang_issue.md`: parent-side trio task parks on an orphaned IPC channel after subint teardown — no clean EOF delivered to the waiting receive. Ctrl-C-able (main loop iterates fine); OUR bug to fix. Candidate fix: explicit parent-side channel abort in `subint_proc`'s hard-kill teardown Cross-link the docs from their test reproducers, - `test_stale_entry_is_deleted` (→ starvation class): wrap `trio.run(main)` in `dump_on_hang(seconds=20)` so a future regression captures a stack dump. Kept un- skipped so the dump file is inspectable - `test_subint_non_checkpointing_child` (→ delivery class): extend docstring with a "KNOWN ISSUE" block pointing at the analysis (this patch was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Log the `claude-opus-4-7` collab that produced `e92e3cd2` ("Doc `subint` backend hang classes + arm `dump_on_hang`"). Substantive bc the two new `ai/conc-anal/` docs were jointly authored — user framed the two-class split + set candidate-fix ordering for the class-2 (Ctrl-C-able) hang; claude drafted the prose and the test-side cross-linking comments. `.raw.md` is in diff-ref mode — per-file pointers via `git diff e92e3cd~1..e92e3cd -- <path>` rather than re-embedding content that already lives in `git log -p`. Prompt-IO: ai/prompt-io/claude/20260420T192739Z_5e8cd8b2_prompt_io.md (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Add a hard process-level wall-clock bound on the two known-hanging subint-backend tests so an unattended suite run can't wedge indefinitely in either of the hang classes doc'd in `ai/conc-anal/`. Deats, - New `testing` dep: `pytest-timeout>=2.3`. - `test_stale_entry_is_deleted`: `@pytest.mark.timeout(3, method='thread')`. The `method='thread'` choice is deliberate — `method='signal'` routes via `SIGALRM` which is starved by the same GIL-hostage path that drops `SIGINT` (see `subint_sigint_starvation_issue.md`), so it'd never actually fire in the starvation case. - `test_subint_non_checkpointing_child`: same decorator, same reasoning (defense-in-depth over the inner `trio.fail_after(15)`). At timeout, `pytest-timeout` hard-kills the pytest process itself — that's the intended behavior here; the alternative is the suite never returning. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Add two more tests to the catalog in `conc-anal/subint_sigint_starvation_issue.md` — same signal-wakeup-fd-saturation fingerprint (abandoned legacy-subint driver threads → shared-GIL starvation → `write() = EAGAIN` on the wakeup pipe → silent SIGINT drop), different load patterns. Deats, - `test_cancel_while_childs_child_in_sync_sleep[subint-False]`: nested actor-tree + sync-sleeping grandchild. Under `trio`/`mp_*` the "zombie reaper" is a subproc `SIGKILL`; no equivalent exists under subint, so the grandchild persists in its abandoned driver thread. Often only manifests under full-suite runs (earlier tests seed the abandoned-thread pool). - `test_multierror_fast_nursery[subint-25-0.5]`: 25 concurrent subactors all go through teardown on the multierror. Bounded hard-kills run in parallel — so the total budget is ~3s, not 3s × 25. Leaves 25 abandoned driver threads simultaneously alive, an extreme pressure multiplier. `strace` shows several successful `write(16, "\2", 1) = 1` (GIL round-robin IS giving main brief slices) before finally saturating with `EAGAIN`. Also include a `pstree -snapt <pid>` capture showing 16+ live `{subint-driver[<interp_id>}` threads at the moment of hang — the direct GIL-contender population. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

A reusable `@pytest.mark.skipon_spawn_backend( '<backend>' [, ...], reason='...')` marker for backend-specific known-hang / -borked cases — avoids scattering `@pytest.mark.skipif(lambda ...)` branches across tests that misbehave under a particular `--spawn-backend`. Deats, - `pytest_configure()` registers the marker via `addinivalue_line('markers', ...)`. - New `pytest_collection_modifyitems()` hook walks each collected item with `item.iter_markers( name='skipon_spawn_backend')`, checks whether the active `--spawn-backend` appears in `mark.args`, and if so injects a concrete `pytest.mark.skip( reason=...)`. `iter_markers()` makes the decorator work at function, class, or module (`pytestmark = [...]`) scope transparently. - First matching mark wins; default reason is `f'Borked on --spawn-backend={backend!r}'` if the caller doesn't supply one. Also, tighten type annotations on nearby `pytest` integration points — `pytest_configure`, `debug_mode`, `spawn_backend`, `tpt_protos`, `tpt_proto` — now taking typed `pytest.Config` / `pytest.FixtureRequest` params. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

Adopt the `@pytest.mark.skipon_spawn_backend('subint', reason=...)` marker (a617b52) across the suites reproducing the `subint` GIL-contention / starvation hang classes doc'd in `ai/conc-anal/subint_*_issue.md`. Deats, - Module-level `pytestmark` on full-file-hanging suites: - `tests/test_cancellation.py` - `tests/test_inter_peer_cancellation.py` - `tests/test_pubsub.py` - `tests/test_shm.py` - Per-test decorator where only one test in the file hangs: - `tests/discovery/test_registrar.py ::test_stale_entry_is_deleted` — replaces the inline `if start_method == 'subint': pytest.skip` branch with a declarative skip. - `tests/test_subint_cancellation.py ::test_subint_non_checkpointing_child`. - A few per-test decorators are left commented-in- place as breadcrumbs for later finer-grained unskips. Also, some nearby tidying in the affected files: - Annotate loose fixture / test params (`pytest.FixtureRequest`, `str`, `tuple`, `bool`) in `tests/conftest.py`, `tests/devx/conftest.py`, and `tests/test_cancellation.py`. - Normalize `"""..."""` → `'''...'''` docstrings per repo convention on a few touched tests. - Add `timeout=6` / `timeout=10` to `@tractor_test(...)` on `test_cancel_infinite_streamer` and `test_some_cancels_all`. - Drop redundant `spawn_backend` param from `test_cancel_via_SIGINT`; use `start_method` in the `'mp' in ...` check instead. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code

goodboy · 2026-04-24T00:43:10Z

+#   https://peps.python.org/pep-0684/
+# - stdlib docs (3.14+):
+#   https://docs.python.org/3.14/library/concurrent.interpreters.html
+# - CPython public wrapper source (`Lib/concurrent/interpreters/`):


we're missing a link to the older pep-554 here?

goodboy · 2026-04-24T01:43:31Z

+from trio import TaskStatus
+
+
+# NOTE: we reach into the *private* `_interpreters` C module


Reason for pinning to the legacy API..

goodboy · 2026-04-24T01:45:46Z

@@ -0,0 +1,350 @@
+# `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)


The full reason we need a GIL per subint..

goodboy added 26 commits April 20, 2026 16:08

Ignore notes & snippets subdirs in git

9a360fb

Bump xonsh to latest pre 0.23 release

73e83e1

Pin xonsh to GH main in editable mode

3773ad8

Avoid skip .ipc._ringbuf import when no cffi

1037c05

Bump lock-file for pytest-timeout + 3.13 gated wheel-deps

9bd1131

Add global 200s pytest-timeout

17f1ea5

Skip test_stale_entry_is_deleted hanger with subints

e2b1035

goodboy changed the title ~~A Sub-interpreter spawn backend~~ A Sub-interpreter-in-thread spawning backend Apr 22, 2026

goodboy changed the title ~~A Sub-interpreter-in-thread spawning backend~~ A subinterpreter-in-thread spawning backend Apr 22, 2026

Base automatically changed from spawn_modularize to main April 23, 2026 22:42

goodboy mentioned this pull request Apr 24, 2026

Port to new concurrent.interpreters support in cpython 3.14+ jcrist/msgspec#1026

Open

goodboy commented Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A subinterpreter-in-thread spawning backend#446

A subinterpreter-in-thread spawning backend#446
goodboy wants to merge 26 commits intomainfrom
subint_spawner_backend

goodboy commented Apr 21, 2026

Uh oh!

goodboy Apr 24, 2026

Uh oh!

goodboy Apr 24, 2026

Uh oh!

goodboy Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		from trio import TaskStatus


		# NOTE: we reach into the private `_interpreters` C module

		@@ -0,0 +1,350 @@
		# `subint` backend: abandoned-subint thread can wedge main trio event loop (Ctrl-C unresponsive)

Conversation

goodboy commented Apr 21, 2026

Uh oh!

goodboy Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

goodboy Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant