Skip to content

feat(examples): add evaluation optimization closed-loop example#119

Open
YAO-001 wants to merge 9 commits into
trpc-group:mainfrom
YAO-001:codex/issue-91-eval-optimize-loop
Open

feat(examples): add evaluation optimization closed-loop example#119
YAO-001 wants to merge 9 commits into
trpc-group:mainfrom
YAO-001:codex/issue-91-eval-optimize-loop

Conversation

@YAO-001

@YAO-001 YAO-001 commented Jul 4, 2026

Copy link
Copy Markdown

English

This PR implements issue #91 with a reproducible Evaluation + Optimization closed-loop example.

Highlights:

  • Adds examples/optimization/eval_optimize_loop with train/validation evalsets, baseline prompt, optimizer config, README, DESIGN, and example reports.
  • Provides a deterministic fake mode that runs without API keys and covers baseline train/validation evaluation, failure attribution, candidate optimization, validation regression, per-case delta, configurable gate, and audit artifacts.
  • Rejects overfit candidates that improve train results but regress validation/protected cases.
  • Adds SDK mode using real AgentOptimizer and TargetPrompt integration. SDK mode maps OptimizeResult aggregate metrics, cost, duration, token usage, best prompts, and round summaries into the same report.
  • Keeps SDK wrapper gate config separate from SDK OptimizeConfigFile via --gate-config.
  • Supports multiple TargetPrompt fields without requiring system_prompt.
  • Writes strict JSON and append-only audit artifacts with input hashes, prompt hashes, per-field prompt snapshots, diffs, case results, gate reasons, and reproducibility command.
  • Source prompt write-back is disabled by default; --update-source is required.
  • Adds tests for fake hidden-sample generalization, gate rejection paths, failure attribution, report schema, SDK adapter wiring, SDK aggregate gate, audit safety, run-id validation, strict JSON, and non-finite numeric rejection.

Validation:

  • python -m compileall examples/optimization/eval_optimize_loop
  • python -m pytest examples/optimization/eval_optimize_loop/tests
  • python examples/optimization/eval_optimize_loop/run_pipeline.py --mode fake --trace --output-dir /tmp/eval-optimize-loop-fake
  • python examples/optimization/eval_optimize_loop/run_pipeline.py --fake-model --fake-judge --trace --output-dir /tmp/eval-optimize-loop-legacy

中文

本 PR 实现 issue #91,新增一个可复现的 Evaluation + Optimization 闭环示例。

重点:

  • 新增 examples/optimization/eval_optimize_loop,包含 train/validation evalsets、baseline prompt、optimizer config、README、DESIGN 和示例报告。
  • 提供确定性的 fake mode,无需 API key 即可运行完整流程,覆盖 baseline train/validation 评测、失败归因、候选优化、验证集退化、per-case delta、可配置 gate 和审计产物。
  • gate 会拒绝 train 提升但 validation/protected case 退化的过拟合候选。
  • 新增 SDK mode,接入真实 AgentOptimizerTargetPrompt。SDK mode 将 OptimizeResult 的聚合指标、成本、耗时、token usage、best prompts 和 round summary 映射到同一份审计报告中。
  • SDK wrapper gate config 通过 --gate-config 与 SDK OptimizeConfigFile 解耦。
  • 支持多个 TargetPrompt 字段,不强制要求 system_prompt
  • 输出严格 JSON,并写入 append-only 审计产物,包括输入哈希、prompt 哈希、分字段 prompt 快照、diff、case results、gate reasons 和 reproducibility command。
  • 默认不回写源 prompt;必须显式传入 --update-source
  • 测试覆盖 fake hidden-sample 泛化、gate 拒绝路径、失败归因、报告 schema、SDK adapter wiring、SDK aggregate gate、审计安全、run-id 校验、严格 JSON 和非有限数值拒绝。

验证:

  • python -m compileall examples/optimization/eval_optimize_loop
  • python -m pytest examples/optimization/eval_optimize_loop/tests
  • python examples/optimization/eval_optimize_loop/run_pipeline.py --mode fake --trace --output-dir /tmp/eval-optimize-loop-fake
  • python examples/optimization/eval_optimize_loop/run_pipeline.py --fake-model --fake-judge --trace --output-dir /tmp/eval-optimize-loop-legacy

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@73655ab). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             main        #119   +/-   ##
==========================================
  Coverage        ?   87.51506%           
==========================================
  Files           ?         467           
  Lines           ?       44005           
  Branches        ?           0           
==========================================
  Hits            ?       38511           
  Misses          ?        5494           
  Partials        ?           0           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@YAO-001

YAO-001 commented Jul 4, 2026

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

Rook1ex added a commit to trpc-group/cla-database that referenced this pull request Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant