diff --git a/SPEC.md b/SPEC.md index 342b657d..56c36679 100644 --- a/SPEC.md +++ b/SPEC.md @@ -257,6 +257,7 @@ Common shapes reached for from other languages. The parser and lexer surface eac | `- -*a b *c d` (double-minus) | `- 0 +*a b *c d` (negate the sum) | `ILO-P021` | | `[k fmt2 v 2]` (call in list) | `[k (fmt2 v 2)]` or bind-first | `ILO-P101` | | `pts=gen-pts;cs0=[...];prnt cs0` at top level | `main>_;pts=gen-pts;cs0=[...];prnt cs0` (wrap in `main>_;`) | `ILO-P102` | +| `((((...((1+1))))...))` 1000 deep | bind intermediates, or pass `--max-ast-depth N` | `ILO-P103` | Each case fires a hint pointing at the canonical form; the agent's first retry should be the right one. Identifier-shaped collisions with builtin names (`len=...`, `sin=...`) are rejected with `ILO-P011` plus a rename suggestion. @@ -266,6 +267,8 @@ The top-level chain trap (`ILO-P102`) catches a bare `name=expr` at the top leve The double-minus trap (`ILO-P021`) catches the silent-miscompile shape `- - a b c d` for `` in `{+,*,/}`. Read intuitively as `-(a*b) - (c*d)` but parses as `-((a*b) - (c*d)) = -(a*b) + (c*d)` because the inner `-` greedily consumes both prefix-binop groups as binary subtract and the outer `-` falls back to unary negate. Fix by negating the sum (`- 0 +*a b *c d`) or binding first (`p=*a b;q=*c d;- 0 +p q`). Single-atom variants like `- -a b` remain accepted since they're unambiguous. +The AST depth cap (`ILO-P103`) catches deeply nested source that would otherwise blow the parser stack. Any context that compiles untrusted text - `ilo serv`, the bare-positional dispatch, the `--ast` dump - is exposed to a payload of the shape `((((...((1+1))))...))` 1000 levels deep that recurses straight through the OS thread stack. The default cap of 256 is far above anything hand-written (the in-tree examples top out under 20) and low enough to keep the worst-case stack frame in `parse_atom`/`parse_expr` inside the default 8 MB main-thread stack. Override with `--max-ast-depth N` on `ilo`, `ilo run`, `ilo check`, `ilo build`, and `ilo serv` when a legitimate program needs deeper nesting. + --- ## Comments @@ -1712,6 +1715,8 @@ ilo program.ilo --ast -- print parsed AST as JSON and exit ilo --explain ILO-T004 -- print error explanation and exit ilo help ai -- compact AI spec to stdout (= contents of ai.txt) ilo serv -- long-lived JSON request/response loop +ilo --max-ast-depth N -- cap parser nesting at N (default 256; protects `ilo serv` + and other untrusted-source paths from DoS payloads, raises ILO-P103) ``` **Verb-noun aliases.** `ilo run ` is an exact alias for the bare positional `ilo ` - same dispatch, same engine selection, same arg handling. `ilo build -o ` is an alias for `ilo compile -o `. Both exist to match the toolchain conventions used by `cargo`, `go`, and `zero` so agents and humans can guess the command name without consulting the help text. The bare positional forms remain fully supported for backwards compatibility; nothing has been removed. diff --git a/ai.txt b/ai.txt index 97d90dba..23bfbd9f 100644 --- a/ai.txt +++ b/ai.txt @@ -2,7 +2,7 @@ INTRO: ilo is a token-optimised programming language for AI agents. Every design FILE VERSION PRAGMA: Optional. ^26.5 -- rest of file Top-of-file declaration of the minimum required runtime. First line, no leading whitespace. Sigil-led (principle 4), ~3 tokens (principle 1). First-class syntax, not a magic comment - the lexer recognises `^` only at file start, so `^` elsewhere keeps its `return err` meaning. Pragma absent=Assume latest installed runtime, no diagnostic File targets older than runtime, breaking change between=Fail with migration pointer File targets newer than runtime=Fail asking to upgrade Tooling: `ilo --version-of ` reads the pragma (returns nothing when absent); the formatter canonicalises position when present, never inserts one. Ships with the CalVer cut; 0.x files have no pragma and verify silently. FUNCTIONS: : ...>; No parens around params - `>` separates params from return type `;` separates statements - no newlines required Last expression is the return value (no `return` keyword) Zero-arg call: `make-id()` tot p:n q:n r:n>n;s=*p q;t=*s r;+s t TYPES: `n`=number (f64) `t`=text (string) `b`=bool `_`=any/unknown (wildcard type) `L n`=list of number `R n t`=result: ok=number, err=text `O n`=optional number (nil or n) `M t n`=map from text keys to numbers `S red green blue`=sum type - one of named text variants `F n t`=function type: takes n, returns t (used in HOF params) `order`=named type `a`=type variable - any single lowercase letter except n, t, b [Optional (`O T`)] `O T` accepts either `nil` or a value of type `T`. f x:O n>n;??x 0 -- unwrap optional or default to 0 g>O n;nil -- returns nil (valid O n) h>O n;42 -- returns 42 (valid O n) `??x default` - nil-coalesce: returns `x` if non-nil, else `default`. Unwraps `O T` to `T`. [Sum types (`S a b c`)] Closed set of named text variants. Verifier-enforced; runtime value is always `t`. color x:S red green blue > t ?x{red:"ff0000";green:"00ff00";blue:"0000ff"} Sum types are compatible with `t` - a sum value can be passed to any `t` parameter. [Map type (`M k v`)] Dynamic key-value collection. Keys are typed: text (`t`) or integer (`n`). `Int(1)` and `Text("1")` are distinct keys. mmap -- empty map mset m k v -- return new map with key k set to v mget m k -- value at key k, or nil mget-or m k default -- value at key k, or default if missing (never nil) mhas m k -- b: true if key exists mkeys m -- L t: sorted list of keys mvals m -- L v: values sorted by key mpairs m -- L (L _): sorted [k, v] pairs; mpairs m == zip (mkeys m) (mvals m) mdel m k -- return new map with key k removed len m -- number of entries Numeric keys work directly - no `str` conversion needed. Float keys floor to `i64` at the builtin boundary (matching `at xs i`); NaN/Infinity raise at runtime. idx=mmap idx=mset idx 7 "seven" -- M n t, integer key mget idx 7 -- "seven" mhas idx 7 -- true mhas idx "7" -- false (Int and Text are distinct) `jdmp` stringifies numeric keys for JSON output (JSON object keys are always strings). The round-trip via `jpar` is lossy - numeric keys come back as text. Example: scores>M t n m=mmap m=mset m "alice" 99 m=mset m "bob" 87 mget m "alice" -- 99 [Type variables] A single lowercase letter (other than `n`, `t`, `b`) in type position is a type variable, treated as `unknown` during verification. Used for higher-order function signatures: identity x:a>a;x apply f:F a a x:a>a;f x Type variables provide weak generics - the verifier accepts any type for `a` without consistency checking across call sites. [Inline lambdas] Pass a function literal directly to a HOF instead of defining a one-off top-level helper: by-dist xs:L n>L n;srt (x:n>n;abs x) xs nonempty ws:L t>L t;flt (s:t>b;>(len s) 0) ws sumsq xs:L n>n;fld (a:n x:n>n;+a *x x) xs 0 Syntax: `(: ...>;)`. Same shape as a top-level function declaration, wrapped in parens, no name. **Phase 1 (no captures)** lifts the literal to a synthetic top-level decl and works across every engine (tree, VM, Cranelift JIT, AOT). The body's free variables must all be params, locals defined inside the lambda body, or known top-level fns. **Phase 2 (closure capture)** lets the body reference variables from the enclosing scope: f xs:L n thr:n>L n;flt (x:n>b;>x thr) xs -- captures `thr` Phase 2 captures run natively on every engine: the tree interpreter, the register VM, the Cranelift JIT, and the Cranelift AOT backend. Each free variable is snapshot by value at the call site (`Expr::MakeClosure`) and appended to the call frame's arg slice on dispatch. The AOT backend additionally embeds the postcard-serialised `CompiledProgram` into the binary's `.rodata` and publishes TLS pointers on startup, so dispatch helpers can re-enter the VM on user-fn callbacks. The ctx-arg form (`srt fn ctx xs`) remains the cross-engine alternative when you want explicit state without forming a closure. -NAMING: Short names everywhere. 1–3 chars. `order`=`ord`=truncate `customers`=`cs`=consonants `data`=`d`=single letter `level`=`lv`=drop vowels `discount`=`dc`=initials `final`=`fin`=first 3 `items`=`its`=first 3 Function names follow the same rules. Field names in constructors and external tool names keep their full form - they define the public interface. [Identifier syntax] Identifiers are lowercase ASCII only, optionally with hyphenated segments. Formally: `[a-z][a-z0-9]*(-[a-z0-9]+)*`. Capital letters and underscores are rejected at the binding and call site. run -- OK run-d -- OK (hyphen separates segments) r2 -- OK (digit after first letter) runD -- ERROR (capital letter) RunD -- ERROR (leading capital) run_d -- ERROR (underscore not allowed in bindings) -run -- ERROR (must start with a letter) `runD` in the interactive CLI surfaces as `ILO-L003 unexpected token` with a suggestion to use `run-d` or `rund`. The constraint is intentional: a single lexical shape per identifier keeps the token stream predictable for agents and avoids style debates over camelCase vs snake_case vs kebab-case. The only place capital letters and underscores are accepted is **after `.` or `.?`** at field-access position, so heterogeneous JSON keys from real APIs work without rewriting. See [Field names at dot-access](#field-names-at-dot-access) for the full list of post-dot relaxations (`r.URL`, `r.AccessKey`, `r.user_name`, etc.). Binding names (`AccessKey = ...`) and function names (`AccessKey x:n>n;...`) still error. [Reserved words] The following identifiers are reserved and cannot be used as names: `if`, `return`, `let`, `fn`, `def`, `var`, `const`. Using them produces a friendly error with the ilo equivalent: -- ERROR: `if` is a reserved word. Use: ?cond{true:...;false:...} -- ERROR: `return` is a reserved word. Last expression is the return value. -- ERROR: `let` is a reserved word. Use: name = expr -- ERROR: `fn`/`def` is a reserved word. Use: name param:type > rettype; body Builtin names (`flat`, `frq`, `map`, `flt`, `cat`, `len`, `srt`, `hd`, `tl`, `ord`, `fld`, `lst`, ...) are also rejected as user-function names and as local-binding LHS. Without this, calls to the user fn or use sites of the local binding silently mis-dispatch to the builtin and surface as a confusing `ILO-T006` arity mismatch. The parser intercepts at the declaration site with ILO-P011 and a rename hint: flat n:n>n;n -- ERROR ILO-P011: `flat` is a builtin and cannot be used as a function name -- hint: rename to something like `myflat` or `flatof`. main>n;flat=cat xs " ";spl flat ". " -- ERROR ILO-P011: `flat` is a builtin and cannot be used as a binding name -- hint: rename to something like `myflat` or `flatv`. [Reserved namespaces] Short builtin names are precious surface and ilo reserves a stable subset of them. To save agents (and their carry-forward scripts) from "what got reserved this release?" debugging cycles, the language publishes the full short-name reserve list plus a forward-compatibility rule for future builtins. **Currently reserved short names (1-3 characters).** Every name in this list is a builtin today and triggers `ILO-P011` if used as a binding or user-function name: 1-char e 2-char at hd pi tl rd wr ct 3-char abs avg cap cat cel chr cos det dot env exp fft fld flr flt fmt frq get grp has inv len log lsd lst lwr map max min mod now num ord pow pst rdb rdl rev rgx rng rnd rou run sin slc spl srt str sum tan tau trm unq upr wra wrl zip All builtin aliases (`head`, `length`, `filter`, `concat`, `tail`, `sort`, `reverse`, `flatten`, `contains`, `group`, `average`, `print`, `trim`, `split`, `format`, `regex`, `read`, `readlines`, `readbuf`, `write`, `writelines`, `lset`, `floor`, `ceil`, `round`, `rand`, `random`, `rng`, `string`, `number`, `slice`, `unique`, `fold`) are reserved with the same shadow-prevention semantics as canonical builtin names. Binding an alias name or using it as a user-function name fires `ILO-P011` at parse time with the canonical form in the diagnostic, since the call-site rewrite to the canonical builtin silently bypasses any user binding of the same name. Previously only `rng` and `rand` had individual guards; as of 0.12.1 every alias in the table above is covered by a single `resolve_alias` check, so new aliases automatically inherit the protection when added to the table. Longer builtin names (`acos`, `asin`, `atan`, `flat`, `take`, `drop`, `mget`, `mset`, `mmap`, `prnt`, `mapr`, `solve`, `lstsq`, `clamp`, `cumsum`, `cprod`, `median`, `matmul`, `range`, `window`, `chunks`, `walk`, `glob`, `prod`, `fsize`, `mtime`, `isfile`, `isdir`, …) are also reserved and rejected by `ILO-P011`, but the short-name namespace above is where carry-forward scripts most often collide, so it gets explicit enumeration. **Forward-compatibility rule.** Future ilo releases add new builtins under names **4 characters or longer**. A 2-character name that is not on this list today is safe to use as a binding or function name and stays safe across releases. A 3-character name that is not on this list is _highly likely_ to stay safe but is not a hard promise - the 3-char surface is already dense, and a rare ergonomic win may justify an addition, called out in the changelog. This gives agents a deterministic safe-name strategy: **2 chars**: any unreserved 2-char name is permanently fine for bindings (`ce` for "category", `ix` for index, `mn` for "mean", `pq` for "priority queue", …). Names on the reserved list above never get removed. **3 chars**: prefer unreserved 3-char names where possible. If a future release reserves one, the migration is a 1-character rename plus a changelog entry. **4+ chars**: always safe. New builtins land here first; any short alias is added later only if the long name is unambiguous and the short doesn't shadow a plausible user binding. When a collision does happen, `ILO-P011` surfaces it at the binding site with a rename suggestion - never silently mis-dispatches at the call site (see the `flat=cat xs " "` example above). Combined with the reserve list, that turns every name-collision incident into a single-character rename instead of a debugging spiral. [Cross-language gotchas] Common shapes reached for from other languages. The parser and lexer surface each with a friendly hint: `AND a b`, `OR a b`, `NOT a`=`&a b`, `|a b`, `!a`=`ILO-L001` `=a b`=`<=a b`, `>=a b` (single token)=`ILO-P003` `f=fn x:n>n;+x 1` (lambda)=`(x:n>n;+x 1)` (parenthesised lambda)=`ILO-P009` `\x{+x 1}` (Haskell/Rust lambda)=`(x:n>n;+x 1)` (parenthesised lambda)=`ILO-L001` `main:>n;body`=`main>n;body` (no `:` before `>`)=`ILO-P003` Multi-line body without braces=`@k xs{body}`, `cond{body}` on one line=`ILO-P003` `cond{^"err"}` braced-cond=Braceless `cond ^"err"` for early return=hint only `- -*a b *c d` (double-minus)=`- 0 +*a b *c d` (negate the sum)=`ILO-P021` `[k fmt2 v 2]` (call in list)=`[k (fmt2 v 2)]` or bind-first=`ILO-P101` `pts=gen-pts;cs0=[...];prnt cs0` at top level=`main>_;pts=gen-pts;cs0=[...];prnt cs0` (wrap in `main>_;`)=`ILO-P102` Each case fires a hint pointing at the canonical form; the agent's first retry should be the right one. Identifier-shaped collisions with builtin names (`len=...`, `sin=...`) are rejected with `ILO-P011` plus a rename suggestion. The list-literal call trap (`ILO-P101`) catches the case where a variadic builtin (`fmt`, `fmt2`) appears bare inside `[...]`. Fixed-arity builtins (`str`, `at`, `map`, ...) auto-expand to a call as one element, but variadic ones can't (the parser doesn't know where their args end), so the bare form would silently fall through as multiple elements with the builtin name as an undefined Ref. Fix by wrapping the call in parens (`[k (fmt2 v 2)]`) or binding first. The top-level chain trap (`ILO-P102`) catches a bare `name=expr` at the top level. ilo requires every binding to live inside a function body; a top-level `pts=gen-pts;cs0=[[...]]; ...; prnt cs2` without a `main>_;` (or any) header used to either die on the `=` (a bare `ILO-P003`) or get slurped into a previous function's body and emit a wall of misleading `ILO-T005` cascades on the wrong line. `ILO-P102` collapses both shapes into a single diagnostic that names the offending binding and suggests the canonical `main>_;` wrapper. The double-minus trap (`ILO-P021`) catches the silent-miscompile shape `- - a b c d` for `` in `{+,*,/}`. Read intuitively as `-(a*b) - (c*d)` but parses as `-((a*b) - (c*d)) = -(a*b) + (c*d)` because the inner `-` greedily consumes both prefix-binop groups as binary subtract and the outer `-` falls back to unary negate. Fix by negating the sum (`- 0 +*a b *c d`) or binding first (`p=*a b;q=*c d;- 0 +p q`). Single-atom variants like `- -a b` remain accepted since they're unambiguous. +NAMING: Short names everywhere. 1–3 chars. `order`=`ord`=truncate `customers`=`cs`=consonants `data`=`d`=single letter `level`=`lv`=drop vowels `discount`=`dc`=initials `final`=`fin`=first 3 `items`=`its`=first 3 Function names follow the same rules. Field names in constructors and external tool names keep their full form - they define the public interface. [Identifier syntax] Identifiers are lowercase ASCII only, optionally with hyphenated segments. Formally: `[a-z][a-z0-9]*(-[a-z0-9]+)*`. Capital letters and underscores are rejected at the binding and call site. run -- OK run-d -- OK (hyphen separates segments) r2 -- OK (digit after first letter) runD -- ERROR (capital letter) RunD -- ERROR (leading capital) run_d -- ERROR (underscore not allowed in bindings) -run -- ERROR (must start with a letter) `runD` in the interactive CLI surfaces as `ILO-L003 unexpected token` with a suggestion to use `run-d` or `rund`. The constraint is intentional: a single lexical shape per identifier keeps the token stream predictable for agents and avoids style debates over camelCase vs snake_case vs kebab-case. The only place capital letters and underscores are accepted is **after `.` or `.?`** at field-access position, so heterogeneous JSON keys from real APIs work without rewriting. See [Field names at dot-access](#field-names-at-dot-access) for the full list of post-dot relaxations (`r.URL`, `r.AccessKey`, `r.user_name`, etc.). Binding names (`AccessKey = ...`) and function names (`AccessKey x:n>n;...`) still error. [Reserved words] The following identifiers are reserved and cannot be used as names: `if`, `return`, `let`, `fn`, `def`, `var`, `const`. Using them produces a friendly error with the ilo equivalent: -- ERROR: `if` is a reserved word. Use: ?cond{true:...;false:...} -- ERROR: `return` is a reserved word. Last expression is the return value. -- ERROR: `let` is a reserved word. Use: name = expr -- ERROR: `fn`/`def` is a reserved word. Use: name param:type > rettype; body Builtin names (`flat`, `frq`, `map`, `flt`, `cat`, `len`, `srt`, `hd`, `tl`, `ord`, `fld`, `lst`, ...) are also rejected as user-function names and as local-binding LHS. Without this, calls to the user fn or use sites of the local binding silently mis-dispatch to the builtin and surface as a confusing `ILO-T006` arity mismatch. The parser intercepts at the declaration site with ILO-P011 and a rename hint: flat n:n>n;n -- ERROR ILO-P011: `flat` is a builtin and cannot be used as a function name -- hint: rename to something like `myflat` or `flatof`. main>n;flat=cat xs " ";spl flat ". " -- ERROR ILO-P011: `flat` is a builtin and cannot be used as a binding name -- hint: rename to something like `myflat` or `flatv`. [Reserved namespaces] Short builtin names are precious surface and ilo reserves a stable subset of them. To save agents (and their carry-forward scripts) from "what got reserved this release?" debugging cycles, the language publishes the full short-name reserve list plus a forward-compatibility rule for future builtins. **Currently reserved short names (1-3 characters).** Every name in this list is a builtin today and triggers `ILO-P011` if used as a binding or user-function name: 1-char e 2-char at hd pi tl rd wr ct 3-char abs avg cap cat cel chr cos det dot env exp fft fld flr flt fmt frq get grp has inv len log lsd lst lwr map max min mod now num ord pow pst rdb rdl rev rgx rng rnd rou run sin slc spl srt str sum tan tau trm unq upr wra wrl zip All builtin aliases (`head`, `length`, `filter`, `concat`, `tail`, `sort`, `reverse`, `flatten`, `contains`, `group`, `average`, `print`, `trim`, `split`, `format`, `regex`, `read`, `readlines`, `readbuf`, `write`, `writelines`, `lset`, `floor`, `ceil`, `round`, `rand`, `random`, `rng`, `string`, `number`, `slice`, `unique`, `fold`) are reserved with the same shadow-prevention semantics as canonical builtin names. Binding an alias name or using it as a user-function name fires `ILO-P011` at parse time with the canonical form in the diagnostic, since the call-site rewrite to the canonical builtin silently bypasses any user binding of the same name. Previously only `rng` and `rand` had individual guards; as of 0.12.1 every alias in the table above is covered by a single `resolve_alias` check, so new aliases automatically inherit the protection when added to the table. Longer builtin names (`acos`, `asin`, `atan`, `flat`, `take`, `drop`, `mget`, `mset`, `mmap`, `prnt`, `mapr`, `solve`, `lstsq`, `clamp`, `cumsum`, `cprod`, `median`, `matmul`, `range`, `window`, `chunks`, `walk`, `glob`, `prod`, `fsize`, `mtime`, `isfile`, `isdir`, …) are also reserved and rejected by `ILO-P011`, but the short-name namespace above is where carry-forward scripts most often collide, so it gets explicit enumeration. **Forward-compatibility rule.** Future ilo releases add new builtins under names **4 characters or longer**. A 2-character name that is not on this list today is safe to use as a binding or function name and stays safe across releases. A 3-character name that is not on this list is _highly likely_ to stay safe but is not a hard promise - the 3-char surface is already dense, and a rare ergonomic win may justify an addition, called out in the changelog. This gives agents a deterministic safe-name strategy: **2 chars**: any unreserved 2-char name is permanently fine for bindings (`ce` for "category", `ix` for index, `mn` for "mean", `pq` for "priority queue", …). Names on the reserved list above never get removed. **3 chars**: prefer unreserved 3-char names where possible. If a future release reserves one, the migration is a 1-character rename plus a changelog entry. **4+ chars**: always safe. New builtins land here first; any short alias is added later only if the long name is unambiguous and the short doesn't shadow a plausible user binding. When a collision does happen, `ILO-P011` surfaces it at the binding site with a rename suggestion - never silently mis-dispatches at the call site (see the `flat=cat xs " "` example above). Combined with the reserve list, that turns every name-collision incident into a single-character rename instead of a debugging spiral. [Cross-language gotchas] Common shapes reached for from other languages. The parser and lexer surface each with a friendly hint: `AND a b`, `OR a b`, `NOT a`=`&a b`, `|a b`, `!a`=`ILO-L001` `=a b`=`<=a b`, `>=a b` (single token)=`ILO-P003` `f=fn x:n>n;+x 1` (lambda)=`(x:n>n;+x 1)` (parenthesised lambda)=`ILO-P009` `\x{+x 1}` (Haskell/Rust lambda)=`(x:n>n;+x 1)` (parenthesised lambda)=`ILO-L001` `main:>n;body`=`main>n;body` (no `:` before `>`)=`ILO-P003` Multi-line body without braces=`@k xs{body}`, `cond{body}` on one line=`ILO-P003` `cond{^"err"}` braced-cond=Braceless `cond ^"err"` for early return=hint only `- -*a b *c d` (double-minus)=`- 0 +*a b *c d` (negate the sum)=`ILO-P021` `[k fmt2 v 2]` (call in list)=`[k (fmt2 v 2)]` or bind-first=`ILO-P101` `pts=gen-pts;cs0=[...];prnt cs0` at top level=`main>_;pts=gen-pts;cs0=[...];prnt cs0` (wrap in `main>_;`)=`ILO-P102` `((((...((1+1))))...))` 1000 deep=bind intermediates, or pass `--max-ast-depth N`=`ILO-P103` Each case fires a hint pointing at the canonical form; the agent's first retry should be the right one. Identifier-shaped collisions with builtin names (`len=...`, `sin=...`) are rejected with `ILO-P011` plus a rename suggestion. The list-literal call trap (`ILO-P101`) catches the case where a variadic builtin (`fmt`, `fmt2`) appears bare inside `[...]`. Fixed-arity builtins (`str`, `at`, `map`, ...) auto-expand to a call as one element, but variadic ones can't (the parser doesn't know where their args end), so the bare form would silently fall through as multiple elements with the builtin name as an undefined Ref. Fix by wrapping the call in parens (`[k (fmt2 v 2)]`) or binding first. The top-level chain trap (`ILO-P102`) catches a bare `name=expr` at the top level. ilo requires every binding to live inside a function body; a top-level `pts=gen-pts;cs0=[[...]]; ...; prnt cs2` without a `main>_;` (or any) header used to either die on the `=` (a bare `ILO-P003`) or get slurped into a previous function's body and emit a wall of misleading `ILO-T005` cascades on the wrong line. `ILO-P102` collapses both shapes into a single diagnostic that names the offending binding and suggests the canonical `main>_;` wrapper. The double-minus trap (`ILO-P021`) catches the silent-miscompile shape `- - a b c d` for `` in `{+,*,/}`. Read intuitively as `-(a*b) - (c*d)` but parses as `-((a*b) - (c*d)) = -(a*b) + (c*d)` because the inner `-` greedily consumes both prefix-binop groups as binary subtract and the outer `-` falls back to unary negate. Fix by negating the sum (`- 0 +*a b *c d`) or binding first (`p=*a b;q=*c d;- 0 +p q`). Single-atom variants like `- -a b` remain accepted since they're unambiguous. The AST depth cap (`ILO-P103`) catches deeply nested source that would otherwise blow the parser stack. Any context that compiles untrusted text - `ilo serv`, the bare-positional dispatch, the `--ast` dump - is exposed to a payload of the shape `((((...((1+1))))...))` 1000 levels deep that recurses straight through the OS thread stack. The default cap of 256 is far above anything hand-written (the in-tree examples top out under 20) and low enough to keep the worst-case stack frame in `parse_atom`/`parse_expr` inside the default 8 MB main-thread stack. Override with `--max-ast-depth N` on `ilo`, `ilo run`, `ilo check`, `ilo build`, and `ilo serv` when a legitimate program needs deeper nesting. COMMENTS: -- full line comment +a b -- end of line comment -- no multi-line comments; use consecutive -- lines -- like this Single-line only. `--` to end of line. No multi-line comment syntax - newlines are a human display concern, not a language concern. An entire ilo program can be one line. Use consecutive `--` lines when humans need multi-line comments. Stripped at the lexer level before parsing - comments produce no AST nodes and cost zero runtime tokens. Generating `--` costs 1 LLM token, so comments are essentially free. **Gotcha:** `--x 1` is a comment, not "negate (x minus 1)". The lexer matches `--` greedily as a comment and eats the rest of the line. To negate a subtraction, use a space or bind first: -- DON'T: --x 1 (comment, not negate-subtract) -- DO: - -x 1 (space separates the two minus operators) -- DO: r=-x 1;-r (bind first) OPERATORS: Both prefix and infix notation are supported. **Prefix is preferred** - it is the token-optimal form that eliminates parentheses and produces denser code. Infix is available for readability when needed. [Binary] `+a b`=`a + b`=add / concat / list concat=`n`, `t`, `L` `+=a v`=append to list (returns new list, see [Append semantics](#append-semantics-+=))=`L` `-a b`=`a - b`=subtract=`n` `*a b`=`a * b`=multiply=`n` `/a b`=`a / b`=divide=`n` `=a b`=`a == b`=equal (prefix `=` is preferred; `==a b` also accepted)=any `!=a b`=`a != b`=not equal=any `>a b`=`a > b`=greater than=`n`, `t` `=a b`=`a >= b`=greater or equal=`n`, `t` `<=a b`=`a <= b`=less or equal=`n`, `t` `&a b`=`a & b`=logical AND (short-circuit)=any (truthy) `|a b`=`a | b`=logical OR (short-circuit)=any (truthy) [Append semantics (`+=`)] `+=xs v` is **pure-shaped**, despite the imperative-looking syntax. It returns a new list with `v` appended and does **not** mutate `xs` in the caller's scope. It works in every position a value-producing expression works: -- 1. Rebind (canonical accumulator pattern) xs=[];@i 0..3{xs=+=xs i};xs -- [0, 1, 2] -- 2. Non-rebind assignment (xs preserved) xs=[1, 2, 3];ys=+=xs 99 -- xs is still [1, 2, 3]; ys is [1, 2, 3, 99] -- 3. Pipeline / argument position len +=xs 99 -- length of [xs..., 99] sum +=xs 99 -- sum of [xs..., 99] The rebind shape `xs = +=xs v` is the standard foreach-build accumulator. When the binding is RC=1 the engines mutate the underlying buffer in place (amortised O(1) per push) - but this is a behind-the-scenes optimisation. To any observer the operation is still functional: nothing outside the rebind sees the old `xs`. The non-rebind shape `ys = +=xs v` always allocates a fresh list and leaves `xs` untouched, so source aliases are safe. There is no separate `push` builtin. `+=` covers every use case and is shorter; adding an alias would mean two ways to spell the same operation, costing reasoning tokens and surface area. [Unary] `-x`=negate=`n` `!x`=logical NOT=any (truthy) [Special infix] `a??b`=nil-coalesce (if a is nil, return b)=any `a>>f`=pipe (desugar to `f(a)`)=any [Prefix nesting (no parens needed)] +*a b c -- (a * b) + c *a +b c -- a * (b + c) >=+x y 100 -- (x + y) >= 100 -*a b *c d -- (a * b) - (c * d) The outer prefix op binds the inner prefix subexpression as its **left** operand, regardless of operator precedence. With two same-precedence ops side by side this is easy to misread: */a b c -- (a/b) * c ← NOT (a*b)/c /*a b c -- (a*b) / c ← NOT (a/b)*c +-a b c -- (a-b) + c ← NOT (a+b)-c -+a b c -- (a+b) - c ← NOT (a-b)+c The runtime emits a `hint:` diagnostic when one of these four pairs appears at a prefix position, since the parse order disagrees with the natural left-to-right reading. To force the other grouping, swap the ops or bind the inner result first: -- Want (a*b)/c with a=6, b=2, c=3: r=*a b;/r c -- bind, then divide → 4 /*a b c -- equivalent, swapping the prefix-pair order [Infix precedence] Standard mathematical precedence (higher binds tighter): 6=`*` `/` 5=`+` `-` `+=` 4=`>` `<` `>=` `<=` 3=`=` `!=` 2=`&` 1=`|` Function application binds tighter than all infix operators: f a + b -- (f a) + b, NOT f(a + b) x * y + 1 -- (x * y) + 1 (x + y) * 2 -- parens override precedence Each nested prefix operator saves 2 tokens (no `(` `)` needed). Flat prefix like `+a b` saves 1 char vs `a + b`. Across 25 expression patterns, prefix notation saves **22% tokens** and **42% characters** vs infix. See [research/explorations/prefix-vs-infix/](research/explorations/prefix-vs-infix/) for the full benchmark. Disambiguation: `-` followed by one atom is unary negate, followed by two atoms is binary subtract. [Operands] Operator operands are **atoms** (literals, refs, field access), **nested prefix operators**, or **known-arity function calls**. The prefix-binop operand parser dispatches to call parsing when the ident at the cursor is a known-arity user fn or builtin AND the next token can start another operand: wh >len q 0{body} -- parses as wh > (len q) 0 { body } +f g h -- if f is 1-arity: BinOp(+, Call(f, [g]), h) -lnx 5 lnx 3 -- BinOp(-, Call(lnx, [5]), Call(lnx, [3])) dbl 5 -- Negate(Call(dbl, [5])) - unary on a call This parallels the `??` precedent: `??x default` accepts a call expression on the value side. Applies to every prefix-binop family member - `+`, `-`, `*`, `/`, comparisons, `&`, `|`, `+=` - and to unary negate when the call consumes the only operand. The same expansion also applies to the then/else slots of the prefix-ternary family (`?=cond a b`, `?>cond a b`, …) and the `?h cond a b` keyword form, so `?h =a b sev sc "NONE"` parses `sev sc` as a nested call without parens or a bind-first. Bare locals that shadow a user fn name still resolve via `Ref` rather than expanding into a zero-arg call, so `&e f{...}` where `f` is a local still parses as the bool operator with two refs. When the call expansion isn't available (the ident is a local that shadows a fn name, or the call's arity doesn't fit the remaining tokens), bind the call result first: r=fac p;*n r -- bind, then operate - always unambiguous **Negative literals vs binary minus**: the lexer greedily includes a leading `-` into number tokens. `-1`, `-7`, `-0` are all number literals at fresh-expression positions. To subtract from zero at the start of a statement, use a space: `- 0 v` (Minus token, then `0`, then `v`). f v:n>n;-0 v -- WRONG: -0 is Number(-0.0); v is a stray token f v:n>n;- 0 v -- OK: binary subtract: 0 - v = -v The lexer splits a glued negative literal back into `Minus + Number` when the previous token is one of `;`, `\n`, `=`, `{`, `(`, or `-`. The `-` context covers the operand slot of an outer prefix-minus, so `- -0 a b` lexes as `-, -, 0, a, b` and parses as `Subtract(Subtract(0, a), b)` = `-a - b` rather than tripping `ILO-P020`. Negative literals after an Ident, `[`, or another prefix binop (`+`, `*`, `/`) stay glued so call args (`at xs -1`), list literals (`[-2 1 3]`), and binary operands (`+a -3`) read naturally. **Subtraction spacing convention**: for general subtraction at statement position, write `a - b` with spaces on **both** sides. `a -b` (glued, no space before the `-`) is not a binary subtract: the lexer packs `-b` into a negative-literal token because the previous token (`a`, an Ident) is one of the keep-glued contexts above. That's deliberate so call args and list elements read naturally, but it means `0 -1.5` is a parse error (`ILO-P001: expected declaration, got number `-1.5`` with a tailored hint pointing at this rule). For a bare negative value as an expression, wrap in parens: `(-1.5)`. STRING LITERALS: Text values are written in double quotes. Escape sequences: `\n`=newline (0x0A) `\t`=tab (0x09) `\r`=carriage return (0x0D) `\f`=form feed (0x0C, PDF page separator) `\b`=backspace (0x08) `\v`=vertical tab (0x0B) `\a`=bell (0x07) `\0`=null (0x00) `\"`=literal double quote `\\`=literal backslash `\/`=literal forward slash (JSON passthrough) Unknown escapes (e.g. `\z`) preserve the backslash + char verbatim. "hello\nworld" -- two-line string "col1\tcol2" -- tab-separated spl text "\n" -- split file content into lines spl pdf "\f" -- split pdftotext output into pages @@ -16,6 +16,6 @@ TOOLS (EXTERNAL CALLS): tool "" > timeou IMPORTS: Split programs across files with `use`: use "path/to/file.ilo" -- import all declarations use "path/to/file.ilo" [name1 name2] -- import only named declarations All imported declarations merge into a flat shared namespace - no qualification, no `mod::fn` syntax. The verifier catches name collisions. -- math.ilo dbl n:n>n; *n 2 half n:n>n; /n 2 -- main.ilo use "math.ilo" run n:n>n; dbl! half n [Rules] Path is relative to the importing file's directory Transitive: if `a.ilo` uses `b.ilo`, `b.ilo`'s declarations are visible to `main.ilo` when it uses `a.ilo` Circular imports are an error (`ILO-P018`) Scoped import with unknown name: `ILO-P019` `use` in inline code (no file context): `ILO-P017` [Error codes] `ILO-P017`=File not found or `use` in inline mode `ILO-P018`=Circular import detected `ILO-P019`=Name in `[...]` list not declared in the imported file ERROR HANDLING: `R ok err` return type. Call then match: get-user uid;?{^e:^+"Lookup failed: "e;~d:use d} Compensate/rollback inline: charge pid amt;?{^e:release rid;^+"Payment failed: "e;~cid:continue} [Auto-Unwrap `!`] `func! args` calls `func` and auto-unwraps the Result: if `~v` (Ok), returns `v`; if `^e` (Err), immediately returns `^e` from the enclosing function. inner x:n>R n t;~x outer x:n>R n t;d=inner! x;~d Equivalent to `r=inner x;?r{~v:v;^e:^e}` but in 1 token instead of 12. Rules: The called function must return `R` or `O` (else verifier error ILO-T025) The enclosing function must return `R` (or `O` for Optional callees) (else verifier error ILO-T026) `!` goes after the function name, before args: `get! url` not `get url!` Zero-arg: `fetch!()` [Panic-Unwrap `!!`] `func!! args` is symmetric in shape with `!`, but on the failure path it aborts the program with a runtime diagnostic and exit code 1 instead of propagating. There is no enclosing-return-type constraint, so persona code can use it from `main>t`, `main>n`, or any non-Result / non-Optional context. main>t;rdl!! "input.txt" -- read file, abort with diagnostic if missing main>n;v=num!! "42";v -- parse number, abort on parse error main>n;m=mset mmap "k" 7;mget!! m "k" -- get value or abort if key missing On `^e` (Err) the program writes `panic-unwrap: ` to stderr and exits 1. On `O nil` the program writes `panic-unwrap: expected value, got nil`. On `~v` (Ok) or non-nil Optional, the inner value is extracted, identical to `!`. Rules: The called function must return `R` or `O` (else verifier error ILO-T025) **No constraint on the enclosing function's return type** - this is the difference from `!` `!!` goes after the function name, before args: `rdl!! path` not `rdl path!!` Zero-arg: `fetch!!()` Use `!` when the caller wants to react to the Err (compensate, retry, log). Use `!!` when the failure is a programming or environmental error the caller has no way to recover from - typical in short scripts, glue code, and main entry points. PATTERNS (FOR LLM GENERATORS): [Bind-first pattern] Always bind complex expressions to variables before using them in operators. Operators only accept atoms and nested operators as operands - not function calls. -- DON'T: *n fac -n 1 (fac is an operand of *, not a call) -- DO: r=fac -n 1;*n r (bind call result, then use in operator) [Recursion template] >;;...;;combine 1. **Guard**: base case returns early - `<=n 1 1` (or `<=n 1{1}`) 2. **Bind**: bind recursive call results - `r=fac -n 1` 3. **Combine**: use bound results in final expression - `*n r` [Factorial] fac n:n>n;<=n 1 1;r=fac -n 1;*n r `<=n 1 1` - braceless guard: if n <= 1, return 1 `r=fac -n 1` - recursive call with prefix subtract as argument `*n r` - multiply n by result [Fibonacci] fib n:n>n;<=n 1 n;a=fib -n 1;b=fib -n 2;+a b `<=n 1 n` - braceless guard: return n for 0 and 1 `a=fib -n 1;b=fib -n 2` - two recursive calls, each with prefix arg `+a b` - add results [Multi-statement bodies] Semicolons separate statements. Last expression is the return value. f x:n>n;a=*x 2;b=+a 1;*b b -- (x*2 + 1)^2 Bodies may also be written across multiple newline-separated lines, indented under the signature. The parser stays inside the same function body while it sees an open bracket (`[`, `(`, `{`) or a pipe operator continuation. This makes long literals and multi-line conditional pipelines readable without semicolons: f x:n>n a=*x 2 b=+a 1 *b b g>L n [10, 20, 30, 40, 50, 60, 70, 80] Statement separation reverts to standard rules once brackets close. A blank line ends the current declaration. Windows CRLF (`\r\n`) is normalised to `\n` before lexing, so files edited on Windows parse identically to Unix-line-ending files. [Multi-function files] Functions in a file are separated by **newlines**. The parser strips all newlines, so the token stream is flat. After parsing each function body, the parser uses the next newline-delimited boundary to start the next declaration. A non-last function body's **final expression must not be a bare variable reference (`Ref`) or a function call**, because the parser greedily reads following tokens as additional call arguments. Safe endings prevent this: Binary operator=`+n 0`, `*x 1`=✓=fixed arity - no greedy loop Index access=`xs.0`, `rec.field`=✓=returns `Expr::Index`, not `Ref` Match block=`?v{…}`=✓=ends with `}` ForEach block=`@x xs{…}`=✓=ends with `}` Parenthesised expr=`(x>>f>>g)`=✓=ends with `)` Record constructor=`point x:1 y:2`=✓=parses as `Expr::Record`, not `Ref` Text/number literal=`"ok"`, `42`=✓=literal, not `Ref` Bare variable (`Ref`)=`n`, `result`=✗=greedy loop fires Bare function call=`len xs`, `f a`=✗=greedy loop fires The **last function in a file** can end with anything - greedy parsing stops at EOF. -- Non-last functions: end with a binary expression digs n:n>n;t=str n;l=len t;+l 0 -- +l 0 = l (binary, safe) clmp n:n lo:n hi:n>n;n hi hi;+n 0 -- +n 0 = n (binary, safe; `clamp` is a builtin) -- Last function: bare call is fine sz xs:L n>n;len xs -- EOF - greedy loop stops naturally To use a pipe chain in a non-last function, wrap it in parentheses: dbl-inc x:n>n;(x>>dbl>>inc) -- parens prevent >> from consuming next function's name inc-sq x:n>n;x>>inc>>sq -- last function - no parens needed [DO / DON'T] -- DON'T: fac n:n>n;<=n 1 1;*n fac -n 1 -- ↑ *n sees fac as an atom operand, not a call -- DO: fac n:n>n;<=n 1 1;r=fac -n 1;*n r -- ↑ bind-first: call result goes into r, then *n r works -- DON'T: +fac -n 1 fac -n 2 -- ↑ + takes two operands; fac is just an atom ref -- DO: a=fac -n 1;b=fac -n 2;+a b -- ↑ bind both calls, then combine -ERROR DIAGNOSTICS: ilo verifies programs before execution and reports errors with stable codes, source context, and suggestions. [Error codes] Every error has a stable `ILO-` code. The letter is the namespace - the phase that raised the diagnostic - so agents and tools can route on prefix without parsing the message. Numeric ranges are reserved per namespace with generous gaps, so future codes slot in cleanly and the contract is forward-compatible. `ILO-L000-099`=L=Lexer / tokenisation=active `ILO-P100-199`=P=Parser / syntax=active `ILO-N200-299`=N=Names / resolution=reserved `ILO-I300-399`=I=Imports=reserved `ILO-T400-499`=T=Types=active `ILO-V500-599`=V=Verifier (post-type checks)=reserved `ILO-R600-699`=R=Runtime=active `ILO-D700-799`=D=Deprecation warnings=reserved `ILO-E800-899`=E=Engine-specific limitations=reserved `ILO-S900-999`=S=Skill / spec system=reserved **Historical codes.** ilo shipped with flat numbering inside each namespace - `ILO-L001`, `ILO-P001`, `ILO-T001`, `ILO-R001`, `ILO-W001`, all starting at 001. Those codes remain valid forever. The hundreds-block allocation above applies to new codes from now on, and a cross-engine regression test asserts every emitted code lives in a documented range. **Reserved namespaces.** `N`, `I`, `V`, `D`, `E`, `S` carry no codes today. They are forward declarations so the first code in each category slots into its own range without conflicting with the active namespaces. `D` is earmarked for deprecation warnings: when a feature is scheduled for removal it emits an `ILO-D7xx` warning at compile time without failing the build. Use `--explain` to see a detailed explanation: ilo --explain ILO-T004 [Source context] Errors point at the relevant source location with a caret: error[ILO-T005]: undefined function 'foo' (called with 1 args) --> 1:9 1 | f x:n>n;foo x = note: in function 'f' = suggestion: did you mean 'f'? Parser, verifier, and runtime errors all show source spans. The verifier uses the enclosing statement span as the best available location for expression-level errors. [Suggestions] The verifier provides context-aware hints: **Did you mean?** - Levenshtein-based suggestions for undefined variables, functions, fields, and types **Type conversion** - suggests `str` for n→t, `num` for t→n **Missing arms** - lists uncovered match patterns with types **Arity** - shows expected parameter signature [Error output formats] --ansi / -a ANSI colour (default for TTY) --text / -t Plain text (no colour) --json / -j JSON (default for piped output) --no-hints / -nh Suppress idiomatic hints NO_COLOR=1 Disable colour (same as --text) JSON error output follows a structured schema with `severity`, `code`, `message`, `labels` (with spans), `notes`, and `suggestion` fields. Runtime errors raised from the Cranelift JIT (opt-in via `--jit`) populate `labels` with the source span of the failing operation, matching tree and VM behaviour. Span coverage threads through every JIT runtime helper (unwrap, panic-unwrap, list-get, slice, index, jpth, mget, record-field strict access, builtin dispatch, dynamic call); AOT-compiled binaries inherit the same coverage. Pre-v0.11.6 builds surfaced `{"labels":[]}` for these shapes - if you see an empty labels array on a runtime error, the binary is out of date. AOT binaries also install an async-signal-safe handler in `ilo_aot_init` that catches fatal signals (SIGSEGV, SIGBUS, SIGFPE, SIGILL, SIGABRT) and writes a single JSON line on stderr identifying the signal before the process terminates with the conventional 128+signo exit code. The diagnostic uses `ILO-R015` (AOT runtime fault). Without the handler, a hard fault inside compiled native code would leave the process with raw signal exit (e.g. 139 for SIGSEGV) and no diagnostic — agents driving ilo couldn't distinguish a clean non-zero exit from a hard fault. A SIGSEGV from an AOT binary is always a bug in ilo (codegen or runtime helper); file an issue with the source program and the JSON line. AOT binaries also install an async-signal-safe handler in `ilo_aot_init` that catches fatal signals (SIGSEGV, SIGBUS, SIGFPE, SIGILL, SIGABRT) and writes a single JSON line on stderr identifying the signal before the process terminates with the conventional 128+signo exit code. The diagnostic uses `ILO-R015` (AOT runtime fault). Without the handler, a hard fault inside compiled native code would leave the process with raw signal exit (e.g. 139 for SIGSEGV) and no diagnostic — agents driving ilo couldn't distinguish a clean non-zero exit from a hard fault. A SIGSEGV from an AOT binary is always a bug in ilo (codegen or runtime helper); file an issue with the source program and the JSON line. [Top-level program output] For a program whose entry function returns a Result, the `~`/`^` wrapper is split across streams and exit codes so shell callers do not have to strip a prefix: `~v` (Ok)=`v` (bare)=-=0 `^e` (Err)=-=`^e`=1 any non-Result=`v`=-=0 In `--json` mode the value is always wrapped (`{"schemaVersion": 1, "ok": v}` / `{"schemaVersion": 1, "error": {...}}`) and emitted to stdout; exit codes match the plain-mode table. The `schemaVersion` field was added in 0.12.1 to every CLI `--json` envelope (`run`, `graph`, `--ast`, `serv`, `tools --json`, `spec --json`) so agents can route on a single field across every command. See `JSON_OUTPUT.md` for the full audit table. `Display` on `Value::Ok` / `Value::Err` still renders `~v` / `^e` in every other context (nested values, `prnt`, REPL prompts, error messages, debug output) - only the top-level program-return print path is split. The contract applies uniformly to in-process runners (`ilo prog.ilo`, `--vm`, `--jit`) and to AOT-compiled standalone binaries from `ilo compile`. Both strip the top-level `~`/`^` wrapper on stdout, route `^e` to stderr, and use the same exit codes - output is byte-for-byte identical across every backend. [Idiomatic hints] After successful execution, ilo scans the source for non-canonical forms and emits hints to stderr: hint: `==` → `=` saves 1 char (both mean equality in ilo) hint: `length` → `len` (canonical short form) Builtin alias hints appear at most once per program (the first long-form name found). In JSON mode, hints appear as `{"hints":["..."]}` on stderr. Suppress with `--no-hints` / `-nh`. [CLI invocation] ilo 'code' [args...] -- inline program; default-runs the entry function ilo program.ilo [func] [args] -- if `func` is omitted and the file declares exactly one function, that function runs automatically ilo run program.ilo [func] [a] -- verb form; same dispatch as the bare positional ilo check program.ilo [--json] [--strict] -- run the verifier without executing (exit 0 = clean; --strict treats warnings as exit-code errors) ilo build program.ilo -o out -- AOT compile to a standalone binary (alias for `compile`) ilo program.ilo --ast -- print parsed AST as JSON and exit ilo --explain ILO-T004 -- print error explanation and exit ilo help ai -- compact AI spec to stdout (= contents of ai.txt) ilo serv -- long-lived JSON request/response loop **Verb-noun aliases.** `ilo run ` is an exact alias for the bare positional `ilo ` - same dispatch, same engine selection, same arg handling. `ilo build -o ` is an alias for `ilo compile -o `. Both exist to match the toolchain conventions used by `cargo`, `go`, and `zero` so agents and humans can guess the command name without consulting the help text. The bare positional forms remain fully supported for backwards compatibility; nothing has been removed. **`ilo check`.** Standalone verifier invocation: lex, parse, resolve imports, and run the type verifier without proceeding to bytecode compilation or execution. Exit code 0 means the program is well-typed and verifier-clean; exit code 1 means at least one diagnostic was emitted on stderr. The output mode follows the global flags (`--json` for NDJSON diagnostics, `--text` for plain text, `--ansi` for coloured output; auto-detected when omitted - JSON when stderr is not a TTY, ANSI otherwise). `ilo check` works on both files and inline code; on a syntactically-broken input it still reports the parse error rather than crashing, which is important for editor and agent loops that may feed in half-written programs. **`ilo check --strict`.** Treats every warning-severity diagnostic (ILO-T032 bare `fmt`, ILO-T033 bare `mset` / `+=` / `mdel`, future warning codes) as a hard exit-code failure. The diagnostic stream itself is unchanged: warnings still emit with `severity: "warning"` in the JSON output, so editor integrations that route by severity stay correct. Only the exit code is elevated. CI harnesses that gate merges on `ilo check` should use `--strict` so warnings can't slip through silently; for interactive use, the default (warnings-are-advisory) is the right behaviour. **Default-run.** Inline programs (`ilo 'code'`) and single-function files run their entry function with the remaining CLI args; no explicit function name needed. Multi-function files auto-pick a function called `main` when no positional func arg is supplied. The same heuristic applies to the explicit engine flags - `--vm` and `--jit` both auto-pick `main` on multi-fn files, matching the default-engine behaviour. With no `main` declared, supply a function-name argument. **AOT entry-pick.** `ilo compile file.ilo -o out` (alias `ilo build`) follows the same entry-pick rules as the in-process engines: a single user-defined function is used directly; on multi-function files the entry is `main` if defined, otherwise the explicit positional `func` arg (`ilo compile file.ilo -o out run`); otherwise the compile fails with `ILO-E801` and exits 1 without writing a binary. AOT does not fall back to "first declared function" - that historical default produced binaries that called the wrong entry symbol and SIGSEGV'd at runtime. **Default engine.** The bytecode register VM is the default execution path. It supports every opcode (closures with Phase 2 capture, listview windows, fused len-of-filter, every modern shape), and avoids the JIT compile-and-bail cost paid by the pre-v0.11.9 Cranelift-first default whenever a program touched an opcode the JIT couldn't handle. Cranelift JIT is opt-in via `--jit`; on opt-in, the JIT runs hot numeric loops and falls back to the VM on bailout. Phase 2 captures run natively on every public backend - VM, JIT, and AOT (`ilo compile`); AOT embeds the postcard `CompiledProgram` blob into the binary's `.rodata` so dispatch helpers can re-enter the VM on user-fn callbacks the same way the in-process runners do. For long-running workloads where the JIT pays for itself, opt in explicitly; for most agent workloads the VM is the right default. **Tree-walker is internal-only.** The tree-walking interpreter is no longer user-selectable: `--run-tree` and its `--run` alias were removed from the public CLI in 0.12.1 (they now error with the unknown-flag guard). The interpreter stays in-tree as the dispatch target for HOF / regex / fmt-variadic / IO / sleep / ct / rsrt / closure-bind-ctx shapes the VM and Cranelift haven't lifted natively yet - the VM bails to it transparently for the ops listed by `is_tree_bridge_eligible` (`rgx`, `rgxall`, `rgxall1`, `rgxall-multi`, `rgxsub`, `fmt`, `fmt2`, `rd`, `rdb`, `rdjl`, `rdin`, `rdinl`, `sleep`, `lsd`, `walk`, `glob`, `dirname`, `basename`, `pathjoin`, `fsize`, `mtime`, `isfile`, `isdir`, `run`, `env-all`, `jkeys`, `ct` 2-arg and 3-arg, `rsrt` 2-arg and 3-arg, `dur-parse`, `dur-fmt`, and the closure-bind ctx variants of `map`/`flt`/`fld`/`srt`). Cross-engine parity for those shapes is pinned by `tests/regression_builtin_bridge.rs` and `tests/regression_tree_bridge_invariants.rs`. 0.13.0+ is on track for a hard drop once the bridge consumers are lifted natively and the shared runtime types (`Value`, `MapKey`, `RuntimeError`, math helpers) are extracted from `src/interpreter/` to a non-engine module. **Subcommand dispatch.** The first positional argument is interpreted as a function name when it has the shape of an ilo identifier - `[a-z][a-z0-9]*(-[a-z0-9]+)*` - so `ilo file.ilo list-orders` routes to the `list-orders` function. Args that don't match the ident shape (file paths like `/tmp/data.json`, numbers, sigils, bracketed lists, anything with a `.` or `/`) route to `main` (or the entry function) as a positional CLI arg instead. Trailing dashes (`foo-`), doubled dashes (`foo--bar`), and negative numbers (`-1`) are not idents and pass through as data. **Unknown `--flag` guard.** Any token in the positional tail matching the clean long-flag shape `--word` or `--word-with-dashes` that isn't a recognised flag is rejected upfront with `error: unrecognised flag '--'. Use 'ilo --help' for valid flags. To pass it as a literal arg, separate with '--' first.` and exit 1. This prevents `ilo main.ilo --engine tree` from silently consuming `--engine` as a positional arg (which used to surface as misleading `ILO-R012 no functions defined` or `ILO-R004 main: expected N args, got N+1`). To pass a hyphen-prefixed token through as literal data, place the `--` separator first: `ilo main.ilo -- --foo`. Anything after the first `--` is data. Tokens with `=` (`--key=val`), trailing or doubled dashes (`--foo-`, `--foo--bar`), and negative numbers (`-1`) are not clean flag shapes and pass through unchanged. **Text-typed params.** When the entry function declares a parameter of type `t`, the CLI passes the raw arg through without numeric coercion. `ilo 'f x:t>t;x' 42` returns the string `"42"`, not the number 42. **Exit codes.** A program returning `Value::Err` (or `^reason` from the entry function) exits with code 1 and prints the err payload on stderr. `~v` (Ok) and any non-Result return value exit 0. Verifier and parser errors exit 2. **List args from the CLI.** Comma-separated args become `L n` or `L t` automatically: `ilo 'f xs:L n>n;sum xs' 1,2,3`. +ERROR DIAGNOSTICS: ilo verifies programs before execution and reports errors with stable codes, source context, and suggestions. [Error codes] Every error has a stable `ILO-` code. The letter is the namespace - the phase that raised the diagnostic - so agents and tools can route on prefix without parsing the message. Numeric ranges are reserved per namespace with generous gaps, so future codes slot in cleanly and the contract is forward-compatible. `ILO-L000-099`=L=Lexer / tokenisation=active `ILO-P100-199`=P=Parser / syntax=active `ILO-N200-299`=N=Names / resolution=reserved `ILO-I300-399`=I=Imports=reserved `ILO-T400-499`=T=Types=active `ILO-V500-599`=V=Verifier (post-type checks)=reserved `ILO-R600-699`=R=Runtime=active `ILO-D700-799`=D=Deprecation warnings=reserved `ILO-E800-899`=E=Engine-specific limitations=reserved `ILO-S900-999`=S=Skill / spec system=reserved **Historical codes.** ilo shipped with flat numbering inside each namespace - `ILO-L001`, `ILO-P001`, `ILO-T001`, `ILO-R001`, `ILO-W001`, all starting at 001. Those codes remain valid forever. The hundreds-block allocation above applies to new codes from now on, and a cross-engine regression test asserts every emitted code lives in a documented range. **Reserved namespaces.** `N`, `I`, `V`, `D`, `E`, `S` carry no codes today. They are forward declarations so the first code in each category slots into its own range without conflicting with the active namespaces. `D` is earmarked for deprecation warnings: when a feature is scheduled for removal it emits an `ILO-D7xx` warning at compile time without failing the build. Use `--explain` to see a detailed explanation: ilo --explain ILO-T004 [Source context] Errors point at the relevant source location with a caret: error[ILO-T005]: undefined function 'foo' (called with 1 args) --> 1:9 1 | f x:n>n;foo x = note: in function 'f' = suggestion: did you mean 'f'? Parser, verifier, and runtime errors all show source spans. The verifier uses the enclosing statement span as the best available location for expression-level errors. [Suggestions] The verifier provides context-aware hints: **Did you mean?** - Levenshtein-based suggestions for undefined variables, functions, fields, and types **Type conversion** - suggests `str` for n→t, `num` for t→n **Missing arms** - lists uncovered match patterns with types **Arity** - shows expected parameter signature [Error output formats] --ansi / -a ANSI colour (default for TTY) --text / -t Plain text (no colour) --json / -j JSON (default for piped output) --no-hints / -nh Suppress idiomatic hints NO_COLOR=1 Disable colour (same as --text) JSON error output follows a structured schema with `severity`, `code`, `message`, `labels` (with spans), `notes`, and `suggestion` fields. Runtime errors raised from the Cranelift JIT (opt-in via `--jit`) populate `labels` with the source span of the failing operation, matching tree and VM behaviour. Span coverage threads through every JIT runtime helper (unwrap, panic-unwrap, list-get, slice, index, jpth, mget, record-field strict access, builtin dispatch, dynamic call); AOT-compiled binaries inherit the same coverage. Pre-v0.11.6 builds surfaced `{"labels":[]}` for these shapes - if you see an empty labels array on a runtime error, the binary is out of date. AOT binaries also install an async-signal-safe handler in `ilo_aot_init` that catches fatal signals (SIGSEGV, SIGBUS, SIGFPE, SIGILL, SIGABRT) and writes a single JSON line on stderr identifying the signal before the process terminates with the conventional 128+signo exit code. The diagnostic uses `ILO-R015` (AOT runtime fault). Without the handler, a hard fault inside compiled native code would leave the process with raw signal exit (e.g. 139 for SIGSEGV) and no diagnostic — agents driving ilo couldn't distinguish a clean non-zero exit from a hard fault. A SIGSEGV from an AOT binary is always a bug in ilo (codegen or runtime helper); file an issue with the source program and the JSON line. AOT binaries also install an async-signal-safe handler in `ilo_aot_init` that catches fatal signals (SIGSEGV, SIGBUS, SIGFPE, SIGILL, SIGABRT) and writes a single JSON line on stderr identifying the signal before the process terminates with the conventional 128+signo exit code. The diagnostic uses `ILO-R015` (AOT runtime fault). Without the handler, a hard fault inside compiled native code would leave the process with raw signal exit (e.g. 139 for SIGSEGV) and no diagnostic — agents driving ilo couldn't distinguish a clean non-zero exit from a hard fault. A SIGSEGV from an AOT binary is always a bug in ilo (codegen or runtime helper); file an issue with the source program and the JSON line. [Top-level program output] For a program whose entry function returns a Result, the `~`/`^` wrapper is split across streams and exit codes so shell callers do not have to strip a prefix: `~v` (Ok)=`v` (bare)=-=0 `^e` (Err)=-=`^e`=1 any non-Result=`v`=-=0 In `--json` mode the value is always wrapped (`{"schemaVersion": 1, "ok": v}` / `{"schemaVersion": 1, "error": {...}}`) and emitted to stdout; exit codes match the plain-mode table. The `schemaVersion` field was added in 0.12.1 to every CLI `--json` envelope (`run`, `graph`, `--ast`, `serv`, `tools --json`, `spec --json`) so agents can route on a single field across every command. See `JSON_OUTPUT.md` for the full audit table. `Display` on `Value::Ok` / `Value::Err` still renders `~v` / `^e` in every other context (nested values, `prnt`, REPL prompts, error messages, debug output) - only the top-level program-return print path is split. The contract applies uniformly to in-process runners (`ilo prog.ilo`, `--vm`, `--jit`) and to AOT-compiled standalone binaries from `ilo compile`. Both strip the top-level `~`/`^` wrapper on stdout, route `^e` to stderr, and use the same exit codes - output is byte-for-byte identical across every backend. [Idiomatic hints] After successful execution, ilo scans the source for non-canonical forms and emits hints to stderr: hint: `==` → `=` saves 1 char (both mean equality in ilo) hint: `length` → `len` (canonical short form) Builtin alias hints appear at most once per program (the first long-form name found). In JSON mode, hints appear as `{"hints":["..."]}` on stderr. Suppress with `--no-hints` / `-nh`. [CLI invocation] ilo 'code' [args...] -- inline program; default-runs the entry function ilo program.ilo [func] [args] -- if `func` is omitted and the file declares exactly one function, that function runs automatically ilo run program.ilo [func] [a] -- verb form; same dispatch as the bare positional ilo check program.ilo [--json] [--strict] -- run the verifier without executing (exit 0 = clean; --strict treats warnings as exit-code errors) ilo build program.ilo -o out -- AOT compile to a standalone binary (alias for `compile`) ilo program.ilo --ast -- print parsed AST as JSON and exit ilo --explain ILO-T004 -- print error explanation and exit ilo help ai -- compact AI spec to stdout (= contents of ai.txt) ilo serv -- long-lived JSON request/response loop ilo --max-ast-depth N -- cap parser nesting at N (default 256; protects `ilo serv` and other untrusted-source paths from DoS payloads, raises ILO-P103) **Verb-noun aliases.** `ilo run ` is an exact alias for the bare positional `ilo ` - same dispatch, same engine selection, same arg handling. `ilo build -o ` is an alias for `ilo compile -o `. Both exist to match the toolchain conventions used by `cargo`, `go`, and `zero` so agents and humans can guess the command name without consulting the help text. The bare positional forms remain fully supported for backwards compatibility; nothing has been removed. **`ilo check`.** Standalone verifier invocation: lex, parse, resolve imports, and run the type verifier without proceeding to bytecode compilation or execution. Exit code 0 means the program is well-typed and verifier-clean; exit code 1 means at least one diagnostic was emitted on stderr. The output mode follows the global flags (`--json` for NDJSON diagnostics, `--text` for plain text, `--ansi` for coloured output; auto-detected when omitted - JSON when stderr is not a TTY, ANSI otherwise). `ilo check` works on both files and inline code; on a syntactically-broken input it still reports the parse error rather than crashing, which is important for editor and agent loops that may feed in half-written programs. **`ilo check --strict`.** Treats every warning-severity diagnostic (ILO-T032 bare `fmt`, ILO-T033 bare `mset` / `+=` / `mdel`, future warning codes) as a hard exit-code failure. The diagnostic stream itself is unchanged: warnings still emit with `severity: "warning"` in the JSON output, so editor integrations that route by severity stay correct. Only the exit code is elevated. CI harnesses that gate merges on `ilo check` should use `--strict` so warnings can't slip through silently; for interactive use, the default (warnings-are-advisory) is the right behaviour. **Default-run.** Inline programs (`ilo 'code'`) and single-function files run their entry function with the remaining CLI args; no explicit function name needed. Multi-function files auto-pick a function called `main` when no positional func arg is supplied. The same heuristic applies to the explicit engine flags - `--vm` and `--jit` both auto-pick `main` on multi-fn files, matching the default-engine behaviour. With no `main` declared, supply a function-name argument. **AOT entry-pick.** `ilo compile file.ilo -o out` (alias `ilo build`) follows the same entry-pick rules as the in-process engines: a single user-defined function is used directly; on multi-function files the entry is `main` if defined, otherwise the explicit positional `func` arg (`ilo compile file.ilo -o out run`); otherwise the compile fails with `ILO-E801` and exits 1 without writing a binary. AOT does not fall back to "first declared function" - that historical default produced binaries that called the wrong entry symbol and SIGSEGV'd at runtime. **Default engine.** The bytecode register VM is the default execution path. It supports every opcode (closures with Phase 2 capture, listview windows, fused len-of-filter, every modern shape), and avoids the JIT compile-and-bail cost paid by the pre-v0.11.9 Cranelift-first default whenever a program touched an opcode the JIT couldn't handle. Cranelift JIT is opt-in via `--jit`; on opt-in, the JIT runs hot numeric loops and falls back to the VM on bailout. Phase 2 captures run natively on every public backend - VM, JIT, and AOT (`ilo compile`); AOT embeds the postcard `CompiledProgram` blob into the binary's `.rodata` so dispatch helpers can re-enter the VM on user-fn callbacks the same way the in-process runners do. For long-running workloads where the JIT pays for itself, opt in explicitly; for most agent workloads the VM is the right default. **Tree-walker is internal-only.** The tree-walking interpreter is no longer user-selectable: `--run-tree` and its `--run` alias were removed from the public CLI in 0.12.1 (they now error with the unknown-flag guard). The interpreter stays in-tree as the dispatch target for HOF / regex / fmt-variadic / IO / sleep / ct / rsrt / closure-bind-ctx shapes the VM and Cranelift haven't lifted natively yet - the VM bails to it transparently for the ops listed by `is_tree_bridge_eligible` (`rgx`, `rgxall`, `rgxall1`, `rgxall-multi`, `rgxsub`, `fmt`, `fmt2`, `rd`, `rdb`, `rdjl`, `rdin`, `rdinl`, `sleep`, `lsd`, `walk`, `glob`, `dirname`, `basename`, `pathjoin`, `fsize`, `mtime`, `isfile`, `isdir`, `run`, `env-all`, `jkeys`, `ct` 2-arg and 3-arg, `rsrt` 2-arg and 3-arg, `dur-parse`, `dur-fmt`, and the closure-bind ctx variants of `map`/`flt`/`fld`/`srt`). Cross-engine parity for those shapes is pinned by `tests/regression_builtin_bridge.rs` and `tests/regression_tree_bridge_invariants.rs`. 0.13.0+ is on track for a hard drop once the bridge consumers are lifted natively and the shared runtime types (`Value`, `MapKey`, `RuntimeError`, math helpers) are extracted from `src/interpreter/` to a non-engine module. **Subcommand dispatch.** The first positional argument is interpreted as a function name when it has the shape of an ilo identifier - `[a-z][a-z0-9]*(-[a-z0-9]+)*` - so `ilo file.ilo list-orders` routes to the `list-orders` function. Args that don't match the ident shape (file paths like `/tmp/data.json`, numbers, sigils, bracketed lists, anything with a `.` or `/`) route to `main` (or the entry function) as a positional CLI arg instead. Trailing dashes (`foo-`), doubled dashes (`foo--bar`), and negative numbers (`-1`) are not idents and pass through as data. **Unknown `--flag` guard.** Any token in the positional tail matching the clean long-flag shape `--word` or `--word-with-dashes` that isn't a recognised flag is rejected upfront with `error: unrecognised flag '--'. Use 'ilo --help' for valid flags. To pass it as a literal arg, separate with '--' first.` and exit 1. This prevents `ilo main.ilo --engine tree` from silently consuming `--engine` as a positional arg (which used to surface as misleading `ILO-R012 no functions defined` or `ILO-R004 main: expected N args, got N+1`). To pass a hyphen-prefixed token through as literal data, place the `--` separator first: `ilo main.ilo -- --foo`. Anything after the first `--` is data. Tokens with `=` (`--key=val`), trailing or doubled dashes (`--foo-`, `--foo--bar`), and negative numbers (`-1`) are not clean flag shapes and pass through unchanged. **Text-typed params.** When the entry function declares a parameter of type `t`, the CLI passes the raw arg through without numeric coercion. `ilo 'f x:t>t;x' 42` returns the string `"42"`, not the number 42. **Exit codes.** A program returning `Value::Err` (or `^reason` from the entry function) exits with code 1 and prints the err payload on stderr. `~v` (Ok) and any non-Result return value exit 0. Verifier and parser errors exit 2. **List args from the CLI.** Comma-separated args become `L n` or `L t` automatically: `ilo 'f xs:L n>n;sum xs' 1,2,3`. FORMATTER: Dense output is the default - newlines are for humans, not agents. No flag needed for dense format: ilo 'code' Dense wire format (default) ilo 'code' --dense / -d Same, explicit ilo 'code' --expanded / -e Expanded human format (for code review) [Dense format] Single line per declaration, minimal whitespace. Operators glue to first operand: cls sp:n>t;>=sp 1000{"gold"};>=sp 500{"silver"};"bronze" [Expanded format] Multi-line with 2-space indentation. Operators spaced from operands: cls sp:n > t >= sp 1000 { "gold" } >= sp 500 { "silver" } "bronze" Dense format is canonical - `dense(parse(dense(parse(src)))) == dense(parse(src))`. COMPLETE EXAMPLE: tool get-user"Retrieve user by ID" uid:t>R profile t timeout:5,retry:2 tool send-email"Send an email" to:t subject:t body:t>R _ t timeout:10,retry:1 type profile{id:t;name:t;email:t;verified:b} ntf uid:t msg:t>R _ t;get-user uid;?{^e:^+"Lookup failed: "e;~d:!d.verified{^"Email not verified"};send-email d.email "Notification" msg;?{^e:^+"Send failed: "e;~_:~_}} [Recursive Example] Factorial and Fibonacci as standalone functions: fac n:n>n;<=n 1 1;r=fac -n 1;*n r fib n:n>n;<=n 1 n;a=fib -n 1;b=fib -n 2;+a b diff --git a/examples/ast-depth-cap.ilo b/examples/ast-depth-cap.ilo new file mode 100644 index 00000000..61ef8e5d --- /dev/null +++ b/examples/ast-depth-cap.ilo @@ -0,0 +1,12 @@ +-- ast-depth-cap: ilo caps parser nesting at 256 by default. Untrusted source +-- (think `ilo serv`) can otherwise blow the parser stack with +-- `((((...((1+1))))...))`. Override with `--max-ast-depth N` on `ilo`, +-- `ilo run`, `ilo check`, `ilo build`, or `ilo serv` when a legitimate +-- program needs deeper nesting. Hand-written ilo rarely exceeds depth 10. + +shallow>n;((((1)))) + +main>n;shallow + +-- run: main +-- out: 1 diff --git a/skills/ilo/ilo-agent.md b/skills/ilo/ilo-agent.md index 30e86018..96343d86 100644 --- a/skills/ilo/ilo-agent.md +++ b/skills/ilo/ilo-agent.md @@ -44,6 +44,10 @@ AOT-compiled binaries (`ilo compile`) follow the same contract byte-for-byte. `ilo serv [--mcp m.json] [--tools http.json]` is a long-lived JSON request/response loop on stdin/stdout. Send `{"program":"fn p:n>n;*p 2","func":"fn","args":[21]}`, get `{"ok": 42}` or `{"error":{...}}`. Cuts process-spawn overhead to zero. +## AST depth cap + +Parser nesting is capped at 256 by default — guards `ilo serv` and any other context that compiles untrusted source against `((((...((1+1))))...))` DoS payloads that would otherwise blow the parser stack. Hand-written ilo rarely exceeds depth 10. Override with `--max-ast-depth N` on `ilo`, `ilo run`, `ilo check`, `ilo build`, or `ilo serv` when a real program needs more. Hitting the cap surfaces as `ILO-P103`. + ## Branching Failures / repair: `ilo-edit-loop`. Runnable patterns: `ilo-examples`. Tools: `ilo-tools`. Engine pick: `ilo-engines`. diff --git a/skills/ilo/ilo-errors.md b/skills/ilo/ilo-errors.md index 172aa75f..e6d3e73f 100644 --- a/skills/ilo/ilo-errors.md +++ b/skills/ilo/ilo-errors.md @@ -18,6 +18,7 @@ description: Use this when reading ILO-XXXX error codes or fixing failures. List - **P009 unparenthesised lambda** - wrap `(p:t>r;body)`. - **P020 incomplete function header** - header missing `>type;body`; finish it. - **P021 double-minus prefix-binop trap** - `- -*a b *c d` ambiguous. Use `- 0 +*a b *c d` or bind first. +- **P103 AST nesting depth exceeded** - parser refused source nesting more than 256 levels deep (DoS guard for `ilo serv`). Flatten by binding intermediates, or raise the cap with `--max-ast-depth N`. ## Type diff --git a/src/cli/args.rs b/src/cli/args.rs index 8008de27..e9a2ae2e 100644 --- a/src/cli/args.rs +++ b/src/cli/args.rs @@ -43,6 +43,14 @@ pub struct Global { /// Suppress idiomatic hints after execution. #[arg(long = "no-hints", short = 'n', global = true)] pub no_hints: bool, + + /// Cap on AST nesting depth. Applies to every subcommand that parses source + /// (`run`, `check`, `build`, `serv`). Default 256 — far above anything + /// hand-written, low enough to keep `ilo serv` safe from `((((...))))` + /// DoS payloads against the parser stack. Raise only if a legitimate + /// program needs deeper nesting. + #[arg(long = "max-ast-depth", global = true)] + pub max_ast_depth: Option, } #[derive(Subcommand, Debug)] @@ -874,6 +882,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; // In test environment stderr is typically not a TTY → should return Json. // We can't reliably test the TTY branch, but we can test that explicit_json @@ -896,6 +905,7 @@ mod tests { text: false, json: true, no_hints: false, + max_ast_depth: None, }; assert!(g.explicit_json()); assert_eq!(g.output_mode(), OutputMode::Json); @@ -908,6 +918,7 @@ mod tests { text: true, json: false, no_hints: false, + max_ast_depth: None, }; assert!(!g.explicit_json()); assert_eq!(g.output_mode(), OutputMode::Text); @@ -920,6 +931,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; assert!(!g.explicit_json()); assert_eq!(g.output_mode(), OutputMode::Ansi); diff --git a/src/diagnostic/registry.rs b/src/diagnostic/registry.rs index f06437d7..21fc0307 100644 --- a/src/diagnostic/registry.rs +++ b/src/diagnostic/registry.rs @@ -605,6 +605,30 @@ subtract and negate it explicitly: This diagnostic exists to catch a specific silent-miscompile shape; single-atom variants like `- -a b` (negate of subtract over atoms) are unambiguous and remain accepted. +"#, + }, + ErrorEntry { + code: "ILO-P103", + short: "AST nesting depth exceeded", + long: r#"## ILO-P103: AST nesting depth exceeded + +The parser refused a program whose expression or statement tree nests more +deeply than the configured cap (default 256). A deeply nested input is +almost always a denial-of-service payload aimed at `ilo serv` or any other +context that compiles untrusted source — `((((...((1 + 1))))...))` recurses +straight through the OS thread stack on a tree-walker parser, and pathological +verifier complexity follows from there. + +The default cap of 256 is far above anything hand-written: the deepest +expression in the in-tree examples is under 20 levels. If a legitimate program +genuinely needs more, raise the cap with `--max-ast-depth N` on `ilo`, +`ilo run`, `ilo check`, `ilo build`, or `ilo serv`: + + ilo --max-ast-depth 1024 run prog.ilo + ilo serv --max-ast-depth 1024 + +**Fix:** flatten the expression by binding intermediates, or override the cap +deliberately if the depth is real. "#, }, // ── Type / Verifier ────────────────────────────────────────────────────── diff --git a/src/main.rs b/src/main.rs index 17708174..8f2120c2 100644 --- a/src/main.rs +++ b/src/main.rs @@ -2428,7 +2428,52 @@ fn load_dotenv() { fn main() { load_dotenv(); - let raw_args: Vec = std::env::args().collect(); + let mut raw_args: Vec = std::env::args().collect(); + + // `--max-ast-depth N` is a global flag (see ILO-P103). Strip it from + // `raw_args` here, before either the clap or the bare-positional dispatch + // sees the value, and install it on the parser as a process-wide override. + // Threading the cap through every `parser::parse` call site would touch 30+ + // sites for no behavioural win — every parse in the process is started + // from the same `fn main`, so a single atomic is enough. + let mut i = 1; + while i < raw_args.len() { + // Stop scanning at `--` so a literal positional `--max-ast-depth` arg + // to a user program isn't intercepted. + if raw_args[i] == "--" { + break; + } + if raw_args[i] == "--max-ast-depth" { + if i + 1 >= raw_args.len() { + eprintln!("error: --max-ast-depth requires a value"); + std::process::exit(1); + } + match raw_args[i + 1].parse::() { + Ok(n) if n >= 1 => parser::set_max_ast_depth_override(n), + _ => { + eprintln!( + "error: --max-ast-depth requires a positive integer, got '{}'", + raw_args[i + 1] + ); + std::process::exit(1); + } + } + raw_args.drain(i..i + 2); + continue; + } + if let Some(rest) = raw_args[i].strip_prefix("--max-ast-depth=") { + match rest.parse::() { + Ok(n) if n >= 1 => parser::set_max_ast_depth_override(n), + _ => { + eprintln!("error: --max-ast-depth requires a positive integer, got '{rest}'"); + std::process::exit(1); + } + } + raw_args.remove(i); + continue; + } + i += 1; + } // Global deprecation nudge: if any arg is the old `--run-vm` spelling, // emit the one-shot hint here so it fires uniformly across every @@ -2503,6 +2548,7 @@ fn main() { text: false, json: false, no_hints: false, + max_ast_depth: None, }, args: raw_args, }, @@ -7847,6 +7893,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }, args: vec!["f>n;1".to_string()], }; @@ -7864,6 +7911,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args(vec!["ilo".to_string(), "-ai".to_string()], &global); assert_eq!(code, 0); @@ -7878,6 +7926,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec!["ilo".to_string(), "help".to_string(), "lang".to_string()], @@ -7893,6 +7942,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec!["ilo".to_string(), "help".to_string(), "ai".to_string()], @@ -7908,6 +7958,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args(vec!["ilo".to_string(), "-h".to_string()], &global); assert_eq!(code, 0); @@ -7922,6 +7973,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; // ILO-T001 is a known error code let code = dispatch_bare_args( @@ -7942,6 +7994,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -7961,6 +8014,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; // --explain without a code argument → error exit let code = dispatch_bare_args(vec!["ilo".to_string(), "--explain".to_string()], &global); @@ -7976,6 +8030,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args(vec!["ilo".to_string(), "--version".to_string()], &global); assert_eq!(code, 0); @@ -7988,6 +8043,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args(vec!["ilo".to_string(), "-V".to_string()], &global); assert_eq!(code, 0); @@ -8000,6 +8056,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args(vec!["ilo".to_string(), "-v".to_string()], &global); assert_eq!(code, 0); @@ -8014,6 +8071,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8033,6 +8091,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8061,6 +8120,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec!["ilo".to_string(), "-e".to_string(), "f>n;42".to_string()], @@ -8076,6 +8136,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; // -e with empty code string should fail let code = dispatch_bare_args( @@ -8094,6 +8155,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; // bench mode: requires a func name in rest let code = dispatch_bare_args( @@ -8118,6 +8180,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8138,6 +8201,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8162,6 +8226,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8183,6 +8248,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8202,6 +8268,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8223,6 +8290,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8242,6 +8310,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8263,6 +8332,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8286,6 +8356,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8307,6 +8378,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args( vec![ @@ -8330,6 +8402,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; // Just runs a simple program; tests that global.ansi overrides detected mode let code = dispatch_bare_args(vec!["ilo".to_string(), "f>n;42".to_string()], &global); @@ -8343,6 +8416,7 @@ mod tests { text: true, json: false, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args(vec!["ilo".to_string(), "f>n;42".to_string()], &global); assert_eq!(code, 0); @@ -8355,6 +8429,7 @@ mod tests { text: false, json: true, no_hints: false, + max_ast_depth: None, }; let code = dispatch_bare_args(vec!["ilo".to_string(), "f>n;42".to_string()], &global); assert_eq!(code, 0); @@ -9134,6 +9209,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }, args: vec![], }; @@ -9150,6 +9226,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }, args: vec![], }; @@ -9169,6 +9246,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }, args: vec![], }; @@ -9187,6 +9265,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }, args: vec![], }; @@ -9213,6 +9292,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }, args: vec![], }; @@ -9555,6 +9635,7 @@ mod tests { text: false, json: false, no_hints: false, + max_ast_depth: None, }; // rest has first arg = "double" which matches a function name let code = dispatch_bare_args( diff --git a/src/parser/mod.rs b/src/parser/mod.rs index 6286fc5c..d4fb9608 100644 --- a/src/parser/mod.rs +++ b/src/parser/mod.rs @@ -3,9 +3,51 @@ use crate::builtins::Builtin; use crate::lexer::Token; use std::collections::HashMap; +/// Default cap on AST nesting depth. Borrowed from Zero (rocicorp/mono#6000) +/// after the same "untrusted source can blow the parser stack" attack surface +/// surfaced for ilo: `ilo serv` and the bare-positional dispatch both compile +/// arbitrary text, and a 1 MB blob of `((((...((1+1))))...))` will recurse +/// straight through the OS thread stack on tree-walker parsers. +/// +/// 256 is far above anything a human or agent writes by hand (the deepest +/// expression in the in-tree examples is under 20) and small enough that even +/// the worst-case stack frame in `parse_atom`/`parse_expr` stays inside the +/// default 8 MB main-thread stack with plenty of headroom. Override via +/// `--max-ast-depth N` on `ilo`, `ilo run`, `ilo check`, `ilo build`, and +/// `ilo serv`. +pub const DEFAULT_MAX_AST_DEPTH: usize = 256; + +/// Process-wide override for the AST-depth cap, set once by CLI entry points +/// when `--max-ast-depth N` is parsed. A `0` value means "use the default". +/// Every call into `parser::parse` (or `Parser::new`) reads this so dozens of +/// internal call sites don't have to thread the value through. Tests can clear +/// it back to 0 if they care; in practice it's only written by `fn main`. +static MAX_AST_DEPTH_OVERRIDE: std::sync::atomic::AtomicUsize = + std::sync::atomic::AtomicUsize::new(0); + +/// Install a process-wide AST-depth cap. Called by CLI entry points after +/// parsing `--max-ast-depth N`. Subsequent calls to `parser::parse` / +/// `Parser::new` pick this up automatically. +pub fn set_max_ast_depth_override(cap: usize) { + MAX_AST_DEPTH_OVERRIDE.store(cap, std::sync::atomic::Ordering::Relaxed); +} + +fn effective_max_ast_depth() -> usize { + let v = MAX_AST_DEPTH_OVERRIDE.load(std::sync::atomic::Ordering::Relaxed); + if v == 0 { DEFAULT_MAX_AST_DEPTH } else { v } +} + pub struct Parser { tokens: Vec<(Token, Span)>, pos: usize, + /// Current nesting depth across recursive parse helpers. Incremented at + /// the entry of `parse_expr`, `parse_stmt`, `parse_decl`, `parse_atom`, + /// `parse_pattern`, and `parse_type` via `DepthGuard`. When `depth >= + /// max_depth` the next entry returns `ILO-P103` instead of recursing. + depth: usize, + /// Cap on `depth`. Default `DEFAULT_MAX_AST_DEPTH`; overridable from the + /// CLI for both `ilo` and `ilo serv` via `--max-ast-depth`. + max_depth: usize, /// Parallel to `tokens` with length `tokens.len() + 1`. Entry `i` is /// `Some(span)` iff at least one unindented `Token::Newline` (a top-level /// declaration boundary, as produced by `lexer::normalize_newlines`) sat @@ -51,6 +93,12 @@ type Result = std::result::Result; impl Parser { pub fn new(tokens: Vec<(Token, Span)>) -> Self { + Self::new_with_max_depth(tokens, effective_max_ast_depth()) + } + + /// Construct a parser with a custom AST-depth cap. See `DEFAULT_MAX_AST_DEPTH` + /// for the rationale on the default; the CLI plumbs `--max-ast-depth` here. + pub fn new_with_max_depth(tokens: Vec<(Token, Span)>, max_depth: usize) -> Self { // Filter out newlines — idea9 uses ; as separator. Each surviving // `Token::Newline` came out of `lexer::normalize_newlines`, which // converts indented continuations into `;` and only keeps a literal @@ -83,6 +131,8 @@ impl Parser { Parser { tokens: filtered, pos: 0, + depth: 0, + max_depth: max_depth.max(1), decl_boundary, fn_arity, fn_param_is_fn, @@ -92,6 +142,37 @@ impl Parser { } } + /// Check that incrementing `depth` would stay within `max_depth`. Returns + /// `ILO-P103` otherwise. Call this at the very top of every recursive + /// parse entry point — paired with `depth_inc()` / `depth_dec()` (or the + /// `DepthGuard` RAII helper) so an early-return via `?` still decrements. + fn check_depth(&self) -> Result<()> { + if self.depth >= self.max_depth { + let cap = self.max_depth; + Err(self.error_hint( + "ILO-P103", + format!("AST nesting depth exceeded {cap}"), + format!( + "deeply nested input is almost always a DoS vector against `ilo serv` or a generated payload, not real source. raise the cap with `--max-ast-depth N` if a legitimate program needs more than {cap} levels of nesting." + ), + )) + } else { + Ok(()) + } + } + + fn depth_inc(&mut self) { + self.depth += 1; + } + + fn depth_dec(&mut self) { + // Saturating: depth invariants in tests/asserts catch bugs without + // panicking a real CLI run. + if self.depth > 0 { + self.depth -= 1; + } + } + /// Returns `Some(span)` if an unindented newline (top-level declaration /// boundary) sits immediately before the current token. The span points /// at the newline byte itself, but for diagnostic anchoring callers @@ -459,6 +540,14 @@ impl Parser { } fn parse_decl(&mut self) -> Result { + self.check_depth()?; + self.depth_inc(); + let result = self.parse_decl_body(); + self.depth_dec(); + result + } + + fn parse_decl_body(&mut self) -> Result { // Reserved-keyword binding attempts: `var=5`, `let=5`, `if=5`, ... // Surface the friendly ILO-P011 message before any expression-level // cascade fires. Use the binding-context hint (rename to a non-reserved @@ -925,6 +1014,14 @@ impl Parser { // ---- Types ---- fn parse_type(&mut self) -> Result { + self.check_depth()?; + self.depth_inc(); + let result = self.parse_type_body(); + self.depth_dec(); + result + } + + fn parse_type_body(&mut self) -> Result { // Safety net: if we're about to read a type from across a top-level // declaration boundary or from past EOF, the source is malformed (a // nested type slot ran off the end of its line — e.g. @@ -1155,6 +1252,14 @@ impl Parser { } fn parse_stmt(&mut self) -> Result { + self.check_depth()?; + self.depth_inc(); + let result = self.parse_stmt_body(); + self.depth_dec(); + result + } + + fn parse_stmt_body(&mut self) -> Result { // Reserved-keyword binding attempts inside a function body: `var=5`, // `let=5`, `if=5`, ... Surface the friendly ILO-P011 message before // `parse_atom` cascades into a cryptic ILO-P009. Use binding-context @@ -1817,6 +1922,14 @@ impl Parser { } fn parse_pattern(&mut self) -> Result { + self.check_depth()?; + self.depth_inc(); + let result = self.parse_pattern_body(); + self.depth_dec(); + result + } + + fn parse_pattern_body(&mut self) -> Result { match self.peek() { Some(Token::Caret) => { self.advance(); @@ -2034,6 +2147,14 @@ impl Parser { // ---- Expressions ---- fn parse_expr(&mut self) -> Result { + self.check_depth()?; + self.depth_inc(); + let result = self.parse_expr_body(); + self.depth_dec(); + result + } + + fn parse_expr_body(&mut self) -> Result { let expr = match self.peek() { Some(Token::Tilde) => { self.advance(); @@ -3550,6 +3671,14 @@ results first: `r={first_op}a b;…r` keeps each step explicit." /// Parse an atom — the smallest expression unit fn parse_atom(&mut self) -> Result { + self.check_depth()?; + self.depth_inc(); + let result = self.parse_atom_body(); + self.depth_dec(); + result + } + + fn parse_atom_body(&mut self) -> Result { match self.peek().cloned() { Some(Token::Number(n)) => { self.advance(); @@ -4595,7 +4724,18 @@ fn is_guard_eligible_condition(expr: &Expr) -> bool { /// the program for execution — error nodes are skipped by the verifier but not /// by the backends. pub fn parse(tokens: Vec<(Token, Span)>) -> (Program, Vec) { - let mut parser = Parser::new(tokens); + parse_with_max_depth(tokens, effective_max_ast_depth()) +} + +/// Same as `parse` but with a custom AST-depth cap. CLI entry points +/// (`ilo run`, `ilo check`, `ilo build`, `ilo serv`) plumb `--max-ast-depth` +/// here so an operator can override the default `DEFAULT_MAX_AST_DEPTH` when a +/// legitimate program needs deeper nesting. +pub fn parse_with_max_depth( + tokens: Vec<(Token, Span)>, + max_depth: usize, +) -> (Program, Vec) { + let mut parser = Parser::new_with_max_depth(tokens, max_depth); parser.parse_program() } diff --git a/tests/parser_depth_cap.rs b/tests/parser_depth_cap.rs new file mode 100644 index 00000000..bdf6c00a --- /dev/null +++ b/tests/parser_depth_cap.rs @@ -0,0 +1,202 @@ +//! Regression tests for the AST nesting-depth cap (ILO-P103). +//! +//! Borrowed from Zero (rocicorp/mono#6000): any context that compiles +//! untrusted source — `ilo serv`, the bare-positional dispatch — is exposed +//! to deeply nested expressions that can blow the parser stack. These tests +//! pin the cap behaviour: +//! +//! 1. A 1000-deep nested expression is rejected with `ILO-P103` at the +//! default cap (256). +//! 2. The cap is overridable via `parser::parse_with_max_depth`, mirroring +//! the `--max-ast-depth` CLI flag exposed on `ilo` and `ilo serv`. +//! 3. The same input fed through the same parse pipeline `ilo serv` uses is +//! rejected with an `ILO-P103` parse-phase diagnostic, not a stack +//! overflow. + +use ilo::ast::Span; +use ilo::lexer; +use ilo::parser::{self, DEFAULT_MAX_AST_DEPTH}; + +fn lex_to_pairs(src: &str) -> Vec<(lexer::Token, Span)> { + let tokens = lexer::lex(src).expect("lex failed"); + tokens + .into_iter() + .map(|(t, r)| { + ( + t, + Span { + start: r.start, + end: r.end, + }, + ) + }) + .collect() +} + +/// Build a `n`-deep nested expression: `main>n;(((...((1))...)))`. +/// Each paren bumps parser depth by 2 (`parse_expr` → `parse_atom`), so at +/// `n = DEFAULT_MAX_AST_DEPTH / 2` and above the cap fires. +fn deeply_nested_source(n: usize) -> String { + let mut s = String::with_capacity(7 + n * 2 + 1); + s.push_str("main>n;"); + for _ in 0..n { + s.push('('); + } + s.push('1'); + for _ in 0..n { + s.push(')'); + } + s +} + +/// Debug parser frames are ~24 KB each; even with the depth cap in place the +/// parser still recurses up to the cap before erroring out, which blows past +/// the 2 MB default test thread stack. Every test here runs on a 32 MB stack +/// so the cap fires logically rather than via SIGSEGV. +fn run_on_fat_stack(f: impl FnOnce() + Send + 'static) { + std::thread::Builder::new() + .stack_size(32 * 1024 * 1024) + .spawn(f) + .expect("spawn test thread") + .join() + .expect("thread panicked"); +} + +#[test] +fn deep_nest_at_default_cap_triggers_p103() { + run_on_fat_stack(|| { + let src = deeply_nested_source(1000); + let pairs = lex_to_pairs(&src); + let (_prog, errs) = parser::parse(pairs); + assert!( + errs.iter().any(|e| e.code == "ILO-P103"), + "expected ILO-P103 at default cap, got {:?}", + errs.iter().map(|e| e.code).collect::>() + ); + let p103 = errs + .iter() + .find(|e| e.code == "ILO-P103") + .expect("ILO-P103 present"); + assert!( + p103.message.contains(&DEFAULT_MAX_AST_DEPTH.to_string()), + "P103 message should name the cap; got {:?}", + p103.message + ); + assert!( + p103.hint + .as_deref() + .unwrap_or("") + .contains("--max-ast-depth"), + "P103 hint should point at the override flag; got {:?}", + p103.hint + ); + }); +} + +#[test] +fn deep_nest_under_cap_parses_clean() { + run_on_fat_stack(|| { + // 100 parens => depth 200 < 256 default. + let src = deeply_nested_source(100); + let pairs = lex_to_pairs(&src); + let (_prog, errs) = parser::parse(pairs); + assert!( + errs.is_empty(), + "expected clean parse under cap, got {errs:?}" + ); + }); +} + +#[test] +fn explicit_override_raises_cap() { + run_on_fat_stack(|| { + // 140 parens => depth ~280 > 256 default; under a raised 1024 cap + // the same program parses clean. + let src = deeply_nested_source(140); + let pairs = lex_to_pairs(&src); + let (_prog, errs_default) = parser::parse(pairs.clone()); + assert!( + errs_default.iter().any(|e| e.code == "ILO-P103"), + "140-deep should trip the default cap; got {:?}", + errs_default.iter().map(|e| e.code).collect::>() + ); + let (_prog, errs_raised) = parser::parse_with_max_depth(pairs, 1024); + assert!( + errs_raised.is_empty(), + "expected clean parse under 1024 cap, got {errs_raised:?}" + ); + }); +} + +#[test] +fn explicit_override_can_lower_cap() { + run_on_fat_stack(|| { + // 30-deep nest (depth ~60) is rejected under a tight cap. + let src = deeply_nested_source(30); + let pairs = lex_to_pairs(&src); + let (_prog, errs) = parser::parse_with_max_depth(pairs, 32); + assert!( + errs.iter().any(|e| e.code == "ILO-P103"), + "expected ILO-P103 under tight cap, got {:?}", + errs.iter().map(|e| e.code).collect::>() + ); + }); +} + +/// `ilo serv` exposes a JSON-over-stdio surface that compiles arbitrary +/// program text from clients. The depth cap must reject a deep-nest payload +/// before the parser blows the stack. We don't drive the full `serv_cmd` +/// stdio loop here (it owns stdin), but we exercise the same parse pipeline +/// the serv request handler uses and assert the failure mode. +#[test] +fn serv_style_parse_rejects_deep_nest() { + run_on_fat_stack(|| { + let src = deeply_nested_source(1000); + let tokens = lexer::lex(&src).expect("lex"); + let token_spans: Vec<_> = tokens + .into_iter() + .map(|(t, r)| { + ( + t, + Span { + start: r.start, + end: r.end, + }, + ) + }) + .collect(); + let (_prog, errs) = parser::parse(token_spans); + assert!( + errs.iter().any(|e| e.code == "ILO-P103"), + "serv parse path must reject deep nest with ILO-P103, got {:?}", + errs.iter().map(|e| e.code).collect::>() + ); + }); +} + +/// Defensive: a deeply nested *statement* chain (foreach/if etc.) doesn't +/// share the paren path, but it still pumps `parse_stmt` recursively. Confirm +/// the depth cap covers that surface too. +#[test] +fn deep_nest_statement_chain_triggers_p103() { + run_on_fat_stack(|| { + let mut src = String::from("main>n;"); + // wh true{wh true{wh true{ ... ; 1 ... }}} — each `wh true{` adds a + // nested statement level (parse_stmt -> body -> parse_stmt). + let n = 300; + for _ in 0..n { + src.push_str("wh true{"); + } + src.push('1'); + for _ in 0..n { + src.push('}'); + } + let pairs = lex_to_pairs(&src); + let (_prog, errs) = parser::parse(pairs); + assert!( + errs.iter().any(|e| e.code == "ILO-P103"), + "deep statement chain should trip ILO-P103, got {:?}", + errs.iter().map(|e| e.code).collect::>() + ); + }); +}