v1.57.2.0 feat: AskUserQuestion prose fallback when the tool fails at runtime by garrytan · Pull Request #1908 · garrytan/gstack

garrytan · 2026-06-08T02:58:36Z

Summary

When AskUserQuestion fails at runtime, gstack skills no longer stall or hard-block. Each skill detects the failure, classifies whether a human is present, and degrades gracefully.

AskUserQuestion failure fallback (the feature)

bin/gstack-session-kind classifies each session as interactive / headless / spawned, echoed as SESSION_KIND at skill start (alongside BRANCH / REPO_MODE).
A prose-fallback rule in the preamble (generate-ask-user-format.ts): on a genuine AskUserQuestion failure (tool absent, or present-but-erroring like Conductor's flaky MCP), retry once, then branch on SESSION_KIND — spawned auto-chooses, headless BLOCKs, interactive renders the full decision brief as prose you answer by typing a letter. Mandatory triad: issue ELI10 + per-choice completeness + recommendation. Carves out [plan-tune auto-decide] as not-a-failure; qualifies the former "tool_use, not prose" assertions.
A defensive PostToolUse hook (auq-error-fallback-hook.ts) that, when an AskUserQuestion call errors, injects a reminder to run the fallback for the current session kind. Defensive/inert: only fires on a missing-result/error payload, never on a real answer; inert if the platform doesn't invoke PostToolUse on tool errors (unverified — see the spike doc).

Pre-existing test fixes (bundled per fix-wave discipline)

Kebab testNames for the section-loading E2Es so the TOUCHFILES completeness gate matches them (was failing on main).
Deterministic external-host doc-freshness checks (beforeAll regen) — removes standalone flakiness.

Merge reconciliation

Merged main (carve-guard system); regenerated all hosts; per-carve maxSizeRatio override gives document-release 1.08 headroom for the cross-cutting preamble addition while every other skill keeps 1.05.

Test Coverage

All new code paths covered: gstack-session-kind (env permutations incl. spawned/empty), the prose-fallback rule + mandatory triad + [plan-tune auto-decide] carve-out + OV2 collision guard + a cross-file invariant, and the error hook's detection + per-session-kind directive (incl. the false-positive guards from review). Free suite: 0 failures. Parity: 13/13.

Pre-Landing Review

Codex adversarial review found 3 real issues, all fixed and re-tested:

1. PostToolUse fallback hook shared source 'plan-tune-cathedral' with the question-log hook (same event+matcher) — gstack-settings-hook replaces the entry, so it would clobber plan-tune capture. Fixed: own source 'auq-error-fallback'; ALREADY_INSTALLED requires both sources.
2. isErrorResponse triggered on any string containing 'internal error'/'is_error' — a real answer or {"is_error": false} payload could fire the fallback. Fixed: narrowed to the missing-result sentinel + boolean is_error===true; regression tests added.
3. SDK runner mutated process.env.GSTACK_HEADLESS process-wide (leaked headless into later tests). Fixed: removed; GSTACK_HEADLESS=1 now set in the eval package.json scripts.

Design Review

No frontend files changed — design review skipped.

Eval Results

Gate-tier paid evals ran (diff selected all due to a global touchfile). The AskUserQuestion-bearing skill evals passed (plan, plan-tune, review, office-hours, session-intelligence, deploy, llm-judge). Two non-blocking failures, both proven unrelated: Gemini E2E (periodic-tier, external CLI not authed here, $0) and three ios-qa agent-flow sims (pass 4/4 in isolation; fail only under --concurrent from session-daemon port contention; reference none of this branch's code).

Scope Drift

Scope Check: CLEAN — 21 source files, exactly the AskUserQuestion fallback work plus the two approved pre-existing test fixes and merge reconciliation. No creep.

Plan Completion

PASS — all plan items (Changes 1-7: helper + preamble echo, prose-fallback rule, plan-mode note, harness hardening, regen, tests, runtime hook) are DONE, plus all eng-review decisions (A1-T1) and outside-voice decisions (OV1-OV3) resolved and folded.

TODOS

No TODO items completed in this PR.

Test plan

Free suite green (bun run test, 0 failures) — re-run on final HEAD
Parity 13/13; gate evals: AskUserQuestion-bearing skills passed

🤖 Generated with Claude Code

Classifies the session as spawned | headless | interactive from env markers (OPENCLAW_SESSION / GSTACK_HEADLESS / CONDUCTOR_* / CLAUDE_CODE_ENTRYPOINT / CI), defaulting to interactive. Echoed once at skill start alongside BRANCH/REPO_MODE so the AskUserQuestion-failure fallback can branch without a shell-out at failure time. Degrade-safe: empty/error => interactive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…sions) On a genuine AUQ failure (tool absent, or present-but-erroring like Conductor's flaky MCP returning '[Tool result missing due to internal error]'): retry once, then branch on SESSION_KIND — spawned auto-chooses, headless BLOCKs, interactive renders a prose decision brief the user answers by typing a letter. The prose fallback MUST surface the triad: a clear ELI10 of the issue, a per-choice Completeness score, and a recommendation+why (one paragraph per choice). Carves out the [plan-tune auto-decide] denial as NOT a failure, and qualifies the former 'tool_use, not prose' assertions so the rule isn't self-contradicting. Tests pin the triad, the SESSION_KIND branch, the OV2 collision guard, the always-loaded guarantee, and a cross-file invariant on the auto-decide prefix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Headless harness runs classify as headless (BLOCK on AUQ failure rather than emit a prose question no one reads). SDK runner uses ambient mutation, not the Options.env object, to avoid breaking the SDK auth pipeline. Interactive-path suites opt out by overriding the env per-run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When an AskUserQuestion call returns an error/missing result, this hook injects additionalContext reminding the model to run the prose fallback for the current SESSION_KIND. It does not render prose itself — it guarantees the reminder fires at the moment of failure instead of relying on the model recalling SESSION_KIND. Inert on success and inert if the platform never invokes PostToolUse on tool errors (unverified — could not force the Conductor MCP error in a harness; see the spike doc). The prompt-level fallback covers the case regardless. Decision logic is unit-tested deterministically; registered in setup beside the existing AUQ hooks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Regenerated from the resolver changes (gen:skill-docs --host all). Refreshes the byte-exact ship golden fixtures (claude/codex/factory). Spec prose tightened so the cross-cutting preamble addition stays under the 5% per-skill parity ceiling (investigate 4.8%) — guard unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ES keys The two section-loading E2E tests used display-form testNames ('/ship section-loading', '/plan-ceo-review section-loading') while every other E2E testName and their E2E_TOUCHFILES keys are kebab. The completeness gate does an exact `name in E2E_TOUCHFILES` check, so it failed (pre-existing on main); diff- based selection also couldn't match them. Align to ship-section-loading / plan-ceo-section-loading. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The parameterized host smoke + --host all freshness tests assumed an external `gen:skill-docs --host all` had run first (it never does in `bun test`), so which host reported STALE varied by sibling-test timing — flaky. Regenerate the gitignored external host dirs in a beforeAll so the --dry-run check is deterministic. It still catches non-deterministic generation (the real bug class for regenerated outputs); the tracked-claude freshness test runs earlier and is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fallback

…ent-release Merging main brought the carve of document-release (smaller skeleton); the AUQ prose-fallback adds ~2KB to every skill's always-loaded preamble, landing document-release at ~5.9% over the pre-carve v1.53.0.0 baseline. Add a per-carve maxSizeRatio override (CARVE_GUARDS single source of truth) and bump only this skill to 1.08. All other skills keep the strict 1.05 ceiling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Codex pre-landing review found three real issues: - The PostToolUse fallback hook shared source 'plan-tune-cathedral' with the question-log hook (same event+matcher); gstack-settings-hook replaces the entry, so it would have clobbered plan-tune capture. Give it its own 'auq-error-fallback' source (separate entry, both run); ALREADY_INSTALLED now requires both sources. - isErrorResponse triggered on any string containing 'internal error'/'is_error', so a real answer or a {"is_error": false} payload could fire the fallback after a successful question. Narrow it to the missing-result sentinel + boolean is_error. - The SDK runner mutated process.env.GSTACK_HEADLESS process-wide (leaked headless into later tests). Removed; GSTACK_HEADLESS=1 now lives in the eval package.json scripts, scoped to the invocation and inherited by the SDK child. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

trunk-io · 2026-06-08T02:58:39Z

Merging to main in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

github-actions · 2026-06-08T03:22:36Z

E2E Evals: ❌ FAIL

63/66 tests passed | $10.84 total cost | 12 parallel runners

Suite	Result	Status	Cost
e2e-browse	7/7	✅	$0.42
e2e-deploy	6/6	✅	$1.36
e2e-design	4/4	✅	$0.73
e2e-plan	8/8	✅	$2.72
e2e-qa-workflow	3/3	✅	$1.29
e2e-review	7/7	✅	$2.89
e2e-workflow	4/4	✅	$0.89
llm-judge	24/27	❌	$0.54

12x ubicloud-standard-8 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

Failures

❌ document-release/SKILL.md workflow: unknown
❌ document-release/SKILL.md workflow: unknown
❌ document-release/SKILL.md workflow: unknown

garrytan and others added 11 commits June 7, 2026 17:51

Merge remote-tracking branch 'origin/main' into garrytan/auq-failure-…

83ad66e

…fallback

chore: bump version and changelog (v1.57.2.0)

03053d4

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

garrytan merged commit 4dfdb7c into main Jun 8, 2026
23 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.57.2.0 feat: AskUserQuestion prose fallback when the tool fails at runtime#1908

v1.57.2.0 feat: AskUserQuestion prose fallback when the tool fails at runtime#1908
garrytan merged 11 commits into
mainfrom
garrytan/auq-failure-fallback

garrytan commented Jun 8, 2026

Uh oh!

trunk-io Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Jun 8, 2026

Summary

Test Coverage

Pre-Landing Review

Design Review

Eval Results

Scope Drift

Plan Completion

TODOS

Test plan

Uh oh!

trunk-io Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

E2E Evals: ❌ FAIL

Failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant