Add persistent Python REPL tool for Templates#687
Conversation
Closes #678. Adds a generic code-execution tool a Template's LLM can call to run Python before producing a final answer, with the three properties #678 asked for: linked to the Template's lexical context, persistent state across tool calls, and redirected output streams. - `ReplSession` (evaluation.py): a plain class seeded from a lexical context. Each `run(source)` executes one complete snippet in exec mode through the `parse`/`compile`/`exec` effect operations (so the installed eval-provider owns sandboxing), captures stdout/stderr into a per-call buffer, and persists bindings in `self.locals` across calls. Per-snippet filenames keep each cell's source in linecache so cross-snippet tracebacks format correctly; `_format_user_traceback` trims the effect-machinery frames so the LLM sees only its own code. - `PythonRepl` (completions.py): a `collect_tools` handler exposing an `exec_code` Tool bound to a session that persists for the Template invocation. Off by default. Sessions are keyed by `id(env)` and pruned via `weakref.finalize`, created lazily on first use. - The `exec` op docstring now states its binding-effects contract (after exec, env reflects all top-level bindings — new and rebound). v1 requires `UnsafeEvalProvider`. `RestrictedEvalProvider` drops rebindings of seeded names and lacks RestrictedPython's print wiring (#685); the two restricted REPL tests are xfail(strict) against #685 and will flip to xpass when it lands. Tests: 12 `ReplSession` laws + 7 `PythonRepl` handler tests (deterministic, under UnsafeEvalProvider) plus one replayed gpt-4o-mini integration test where the model uses `exec_code` across rounds to compute a statistic.
|
This doesn't close #685 yet, as it is right now. I'm still deciding how persistent the state between different call this really is (shared between one agent, or all agents and all templates). Also there might be a new problem with OpenAI trimming headers. Investigating. |
I think you only want it to persist for the duration of one |
|
I see, that is simplifying! On it now. |
|
Another question that's coming up is if there are nested template: @Template.define
def foo():
"""Some prompt"""
raise NotHandled
@Template.define
def bar():
"""Some prompt"""
raise NotHandledThen will the lexical context/envs be shared between |
|
Code should be understood to live in the lexical context of the relevant |
ReplSession subclasses the stdlib code.InteractiveInterpreter (compile routed through the parse/compile ops for linecache, runcode through the exec op, tracebacks trimmed to the user's own frames) instead of a plain class; PythonRepl._session_for drops its weakref guard. Adds tests for lexical-scope isolation (new bindings stay local to the body, invisible to siblings) and nested-env session isolation.
Closes #678.
Adds a generic code-execution tool a Template's LLM can call to run Python before producing a final answer. Few desiderata:
(1) linked to the Template's lexical context
(2) persistent state across tool calls
(3) redirected output streams.
Example.
The LLM runs code across rounds with state persisting:
ReplSession(evaluation.py): a plain class seeded from a lexical context. Eachrun(source)executes one complete snippet in exec mode through theparse/compile/execeffect operations (so the installed eval-provider owns sandboxing), captures stdout/stderr into a per-call buffer, and persists bindings inself.localsacross calls. Per-snippet filenames keep each cell's source inlinecacheso cross-snippet tracebacks format correctly; the traceback formatter trims the effect-machinery frames so the LLM sees only its own code.PythonRepl(completions.py): acollect_toolshandler exposing anexec_codeTool bound to a session that persists for the Template invocation. Off by default. Sessions are keyed byid(env)and pruned viaweakref.finalize(no leak, no id-reuse), created lazily on first use.The
execop docstring now states its binding-effects contract: afterexec(bytecode, env),envreflects all top-level bindings — new and rebound alike.Fenced to
UnsafeEvalProvider(#685).RestrictedEvalProviderdrops rebindings of seeded names and lacks RestrictedPython's print wiring. The two restricted REPL tests arexfail(strict)against #685 and flip to xpass when it lands.Tests. 12
ReplSessionlaws and 7PythonReplhandler tests, deterministic underUnsafeEvalProvider(seed, persistence, rebind, multi-statement, print, syntax-error, exception isolation, KeyboardInterrupt-propagation, traceback trim, cross-snippet traceback, reentrancy, effect-routing; off-by-default, exposes-tool, name-collision, composes-with-LexicalReaders, lazy creation, same-session-across-rounds, distinct-env). Plus one replayed gpt-4o-mini integration test where the model genuinely usesexec_codeacross rounds to compute a statistic.Decoupled from
LexicalReadersby default; coupling REPL-created symbols into the readers is noted as a follow-up.