Skip to content

Add persistent Python REPL tool for Templates#687

Draft
datvo06 wants to merge 2 commits into
masterfrom
dn-678-repl-tool
Draft

Add persistent Python REPL tool for Templates#687
datvo06 wants to merge 2 commits into
masterfrom
dn-678-repl-tool

Conversation

@datvo06

@datvo06 datvo06 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Closes #678.

Adds a generic code-execution tool a Template's LLM can call to run Python before producing a final answer. Few desiderata:
(1) linked to the Template's lexical context
(2) persistent state across tool calls
(3) redirected output streams.

Example.

from effectful.handlers.llm import Template, LiteLLMProvider
from effectful.handlers.llm.completions import PythonRepl
from effectful.handlers.llm.evaluation import UnsafeEvalProvider
from effectful.ops.semantics import handler
from effectful.ops.types import NotHandled

readings = [12, 19, 23, 31, 8, 27]

@Template.define
def outlier_count() -> int:
    """Use the `exec_code` tool to compute how many values in `readings`
    lie more than one population stdev from the mean; return that count."""
    raise NotHandled

with handler(LiteLLMProvider()), handler(UnsafeEvalProvider()), handler(PythonRepl()):
    n = outlier_count()

The LLM runs code across rounds with state persisting:

exec_code("m = ...; s = ...")   ->  ""        # readings came from lexical scope
exec_code("print(...using m, s...)")  ->  "2\n"   # m, s persisted from the prior call

ReplSession (evaluation.py): a plain class seeded from a lexical context. Each run(source) executes one complete snippet in exec mode through the parse/compile/exec effect operations (so the installed eval-provider owns sandboxing), captures stdout/stderr into a per-call buffer, and persists bindings in self.locals across calls. Per-snippet filenames keep each cell's source in linecache so cross-snippet tracebacks format correctly; the traceback formatter trims the effect-machinery frames so the LLM sees only its own code.

PythonRepl (completions.py): a collect_tools handler exposing an exec_code Tool bound to a session that persists for the Template invocation. Off by default. Sessions are keyed by id(env) and pruned via weakref.finalize (no leak, no id-reuse), created lazily on first use.

The exec op docstring now states its binding-effects contract: after exec(bytecode, env), env reflects all top-level bindings — new and rebound alike.

Fenced to UnsafeEvalProvider (#685). RestrictedEvalProvider drops rebindings of seeded names and lacks RestrictedPython's print wiring. The two restricted REPL tests are xfail(strict) against #685 and flip to xpass when it lands.

Tests. 12 ReplSession laws and 7 PythonRepl handler tests, deterministic under UnsafeEvalProvider (seed, persistence, rebind, multi-statement, print, syntax-error, exception isolation, KeyboardInterrupt-propagation, traceback trim, cross-snippet traceback, reentrancy, effect-routing; off-by-default, exposes-tool, name-collision, composes-with-LexicalReaders, lazy creation, same-session-across-rounds, distinct-env). Plus one replayed gpt-4o-mini integration test where the model genuinely uses exec_code across rounds to compute a statistic.

Decoupled from LexicalReaders by default; coupling REPL-created symbols into the readers is noted as a follow-up.

Closes #678.

Adds a generic code-execution tool a Template's LLM can call to run
Python before producing a final answer, with the three properties #678
asked for: linked to the Template's lexical context, persistent state
across tool calls, and redirected output streams.

- `ReplSession` (evaluation.py): a plain class seeded from a lexical
  context.  Each `run(source)` executes one complete snippet in exec
  mode through the `parse`/`compile`/`exec` effect operations (so the
  installed eval-provider owns sandboxing), captures stdout/stderr into
  a per-call buffer, and persists bindings in `self.locals` across
  calls.  Per-snippet filenames keep each cell's source in linecache so
  cross-snippet tracebacks format correctly; `_format_user_traceback`
  trims the effect-machinery frames so the LLM sees only its own code.
- `PythonRepl` (completions.py): a `collect_tools` handler exposing an
  `exec_code` Tool bound to a session that persists for the Template
  invocation.  Off by default.  Sessions are keyed by `id(env)` and
  pruned via `weakref.finalize`, created lazily on first use.
- The `exec` op docstring now states its binding-effects contract
  (after exec, env reflects all top-level bindings — new and rebound).

v1 requires `UnsafeEvalProvider`.  `RestrictedEvalProvider` drops
rebindings of seeded names and lacks RestrictedPython's print wiring
(#685); the two restricted REPL tests are xfail(strict) against #685
and will flip to xpass when it lands.

Tests: 12 `ReplSession` laws + 7 `PythonRepl` handler tests
(deterministic, under UnsafeEvalProvider) plus one replayed
gpt-4o-mini integration test where the model uses `exec_code` across
rounds to compute a statistic.
@datvo06 datvo06 marked this pull request as draft June 11, 2026 21:42
@datvo06

datvo06 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

This doesn't close #685 yet, as it is right now. I'm still deciding how persistent the state between different call this really is (shared between one agent, or all agents and all templates). Also there might be a new problem with OpenAI trimming headers. Investigating.

@eb8680

eb8680 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

how persistent the state between different call this really is (shared between one agent, or all agents and all templates)

I think you only want it to persist for the duration of one Template call, not to share across multiple calls or multiple agents.

@datvo06

datvo06 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

I see, that is simplifying! On it now.

@datvo06

datvo06 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Another question that's coming up is if there are nested template:

@Template.define
def foo():
      """Some prompt"""
      raise NotHandled


@Template.define
def bar():
      """Some prompt"""
      raise NotHandled

Then will the lexical context/envs be shared between foo and bar if bar() calls exec() and then foo() as tools? My current take is that it will be separated, bar() calling exec() will have no effect on foo() calling exec() down in the call stack.

@eb8680

eb8680 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Code should be understood to live in the lexical context of the relevant Template body. The body of bar can read the shared lexical context across foo, bar and whatever else is in the same scope, as well as any arguments to bar (none in this example), but any new variables are local to the body of bar and hence invisible to foo.

ReplSession subclasses the stdlib code.InteractiveInterpreter (compile routed through the parse/compile ops for linecache, runcode through the exec op, tracebacks trimmed to the user's own frames) instead of a plain class; PythonRepl._session_for drops its weakref guard. Adds tests for lexical-scope isolation (new bindings stay local to the body, invisible to siblings) and nested-env session isolation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider adding generic code execution tools

2 participants