docs: professional Mintlify documentation site by lopadova · Pull Request #46 · padosoft/eval-harness

lopadova · 2026-06-16T22:35:49Z

Summary

Adds a complete, professional Mintlify documentation site under /docs-site, intended to publish at doc.eval-harness.padosoft.com. Modeled on the AskMyDocs deep-doc template (motivation → theory → design with a diagram → contract → ADR rationale → worked example → gotchas), written from scratch for this package.

What's included

Foundation — docs.json (mint theme, teal palette, 7-group nav), favicon.svg, and a docs-site/README.md with preview/deploy/authoring instructions.
31 MDX pages across 7 groups:
- Get Started — introduction, quickstart, installation, core-concepts
- Guides — golden-datasets, running-evaluations, ci-gate, scoring-saved-outputs, judge-calibration, online-monitoring, adversarial-testing
- Metrics & Theory — overview, lexical-and-structural, semantic-similarity, llm-as-judge, retrieval-ranking, ordinal-and-aggregation
- Best Practices — regression-gating, trustworthy-judges, safety-and-red-teaming
- Configuration & Operations — configuration, batch-execution, horizon-and-queues, troubleshooting
- Architecture — overview, evaluation-pipeline, report-contract, decisions (ADR)
- Reference — cli, report-api, metrics-catalog
Academic depth — KaTeX formulas for ROUGE-L (LCS F-measure), cosine similarity, BERTScore-like greedy matching, and the IR family (recall@k, MRR, full nDCG@k / DCG / IDCG derivation); Mermaid flowchart and sequenceDiagram diagrams throughout.
README — new ## Official Documentation section linking the live docs, plus a Table-of-Contents entry.

Accuracy

Content was verified against the package source — exact Artisan command signatures (eval-harness:run, eval-harness:adversarial + eval:adversarial alias, eval-harness:calibrate-judge), config keys, metric aliases, the evaluation pipeline (incl. the SampleInvocation DTO and lazy-parallel result store), and the read-only report-API endpoints and schema discriminators.

Nav parity confirmed: every docs.json entry maps to an existing .mdx file.

Notes / not yet verified

mint dev was not run locally (no Node/Mintlify CLI in the working environment), so the final rendering of Mermaid diagrams and KaTeX math should be confirmed with a local preview before/at deploy.
Deploy step (outside this PR): connect the Mintlify GitHub App with content directory docs-site/ and bind the doc.eval-harness.padosoft.com custom domain.

Test plan

cd docs-site && mint dev renders with no broken-link errors
Mermaid diagrams and LaTeX formulas render correctly
Custom domain resolves after Mintlify connection

🤖 Generated with Claude Code

Add a complete /docs-site Mintlify documentation site published at doc.eval-harness.padosoft.com, modeled on the AskMyDocs deep-doc template. - docs.json (mint theme), favicon, and docs-site README - 31 MDX pages across 7 groups: Get Started, Guides, Metrics & Theory, Best Practices, Configuration & Operations, Architecture, Reference - Academic depth: KaTeX formulas for ROUGE-L, cosine, BERTScore, and the IR family (recall@k, MRR, nDCG@k), plus Mermaid flow/sequence diagrams - ADR narrative, evaluation pipeline, report contract, CLI/API/config refs - Content verified against the package source (commands, metrics, config keys, report-API endpoints) - README: add Official Documentation section linking the live docs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 036aeb7032

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-16T22:42:22Z

+
+The package is **auto-discovered** — no `config/app.php` edits required. It adds
+**zero migrations** to your schema by default; datasets are YAML and reports are
+JSON/Markdown on a configured disk.


Correct the default migration behavior

This install guide says the package adds zero migrations by default and later presents the online-monitoring migration as publish-only, but the service provider unconditionally calls loadMigrationsFrom(__DIR__.'/../database/migrations'). In a host app that runs php artisan migrate after installing the package, Laravel will include the bundled create_eval_harness_online_scores_table migration even when online monitoring is disabled, so the docs can lead users to unexpected schema changes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T22:42:22Z

+          php artisan eval-harness:run rag.factuality.fy2026 \
+            --registrar="App\\Console\\EvalRegistrar" \
+            --json --out=eval-report.json


Write CI artifacts to a real workspace path

In this workflow, --out=eval-report.json is a relative report path, so WritesEvalReports sends it to the configured reports disk/prefix by default rather than creating ./eval-report.json in the GitHub Actions workspace. The later artifact upload and the jq eval-report.json threshold example will therefore fail to find the report unless the command adds --raw-path or uploads/reads the configured storage path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T22:42:22Z

+
+```json
+{
+  "schema_version": "eval-harness.api.v1",


Use the actual Report API schema version

The runtime Report API envelope uses ReportApiSchema::VERSION, which is eval-harness.report-api.v1, and the route tests assert responses against that constant. A UI client built from this example would pin to eval-harness.api.v1 and reject every live response as unsupported, so the documented envelope version needs to match the emitted value.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T22:42:22Z

+    'middleware' => env('EVAL_HARNESS_API_MIDDLEWARE', 'web,auth'),
+    'trend' => [
+        'max_files_scanned' => env('EVAL_HARNESS_API_TREND_MAX_FILES', 5000),


Align the API config snippet with shipped defaults

This block does not match the published config: the default middleware is [], not web,auth, and the trend scan env var is EVAL_HARNESS_API_TREND_MAX_FILES_SCANNED. If an operator enables only EVAL_HARNESS_API_ENABLED=true based on this snippet, the routes are mounted without auth middleware, and setting EVAL_HARNESS_API_TREND_MAX_FILES has no effect.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T22:42:22Z

+        'disk' => env('EVAL_HARNESS_ADVERSARIAL_DISK'),
+        'path_prefix' => env('EVAL_HARNESS_ADVERSARIAL_PATH'),


Document the manifest discovery env vars accurately

The package reads EVAL_HARNESS_ADVERSARIAL_MANIFEST_DISK and EVAL_HARNESS_ADVERSARIAL_MANIFEST_PATH for HTTP manifest discovery. With the env names shown here, an operator can configure the disk/path exactly as documented but ManifestRepository::discoveryEnabled() still sees no disk and the /adversarial/manifests endpoints keep returning discovery_not_configured.

Useful? React with 👍 / 👎.

- Migrations: correct the "zero migrations" claim — the service provider auto-loads the eval_harness_online_scores migration, so `migrate` creates that one table (feature off by default; publishing only customizes it) - CI gate: add --raw-path so --out lands in the workspace where upload-artifact and jq read it (relative --out goes to the reports disk) - Report API: envelope schema_version is eval-harness.report-api.v1 (ReportApiSchema::VERSION), not eval-harness.api.v1 - Config api block: middleware default is [] (empty, no auth), and the trend env var is EVAL_HARNESS_API_TREND_MAX_FILES_SCANNED - Config adversarial block: env vars are EVAL_HARNESS_ADVERSARIAL_MANIFEST_DISK and EVAL_HARNESS_ADVERSARIAL_MANIFEST_PATH (default eval-harness/adversarial/manifests) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-16T22:56:47Z

@codex review — addressed all five P2 findings in 9200237: migration auto-load wording, CI --raw-path, Report API envelope eval-harness.report-api.v1, empty middleware default + EVAL_HARNESS_API_TREND_MAX_FILES_SCANNED, and the EVAL_HARNESS_ADVERSARIAL_MANIFEST_{DISK,PATH} env names. All verified against the package source.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9200237a00

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-16T23:02:49Z

+  "rows": [
+    { "sample_id": "capital-france", "tags": ["geography", "easy"],
+      "metric": "exact-match", "score": 1.0, "error": null, "details": {} }
+  ]


Document the actual JSON report row contract

The public report-contract example documents a top-level rows array, but JsonReportRenderer::render() emits samples with nested scores plus a separate failures array; the same snippet also uses histograms while the renderer emits metric_distributions. A UI or CI script built from this page will read missing fields from real --json reports, so the contract example should match the renderer's actual keys before publishing it as stable documentation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T23:02:49Z

+Sends the sample `input`, `expected_output`, and the SUT's `actual_output` to the
+configured judge model with a prompt template, and reads back a JSON verdict
+`{"pass": true|false}`. The score is `1.0` on pass, `0.0` on fail. Configure the
+endpoint, model, timeout, and an optional custom `prompt_template` under


Document llm-as-judge's score response shape

This states that llm-as-judge reads {"pass": true|false} and returns a binary score, but LlmAsJudgeMetric::decodeStrictJson() requires a numeric score key and score() returns that continuous 0..1 value. Custom judge prompts/providers following the documented pass schema will be recorded as SampleFailures, and readers will misunderstand graded judge scores.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T23:02:49Z

+where $\mathrm{norm}$ trims and case-folds. Use it when the expected output is a
+single canonical fact — an id, a date, a country, a yes/no — where any deviation


Document exact-match as strict equality

The docs say exact-match trims and case-folds, but ExactMatchMetric::score() compares expectedOutput === actualOutput; whitespace, case, and punctuation are all significant. Users relying on this page will build datasets that they expect to pass after normalization but that actually score 0.0.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T23:02:49Z

+**Marker mode (baseline).** The `expected_output` declares the citation markers
+that should appear (e.g. `[1]`, `[source:refunds]`); the score is `1.0` when all
+declared markers are present in the actual output, else `0.0`.


Put citation markers in metadata, not expected_output

Marker mode is documented as reading citation markers from expected_output, but CitationGroundednessMetric::citationsFor() ignores expected_output and requires metadata.citations to contain the marker string(s). A dataset authored from this guidance will fail every cited sample with a metric contract error instead of producing groundedness scores.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T23:02:49Z

+Discounted Cumulative Gain at $k$, with gain $g_i$ for the document at rank $i$:
+
+$$
+\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{g_i} - 1}{\log_2(i + 1)}
+$$


Match the implemented nDCG gain formula

This formula documents exponential gains (2^g - 1), but RetrievalNdcgAtKMetric::scoreRanked() computes DCG/IDCG with linear gains ($gain / log(rank + 1, 2)). For graded relevance maps, the documented nDCG values and thresholds will not match the package's actual scores.

Useful? React with 👍 / 👎.

Second round of accuracy fixes verified against the package source: - exact-match: document strict === (case/whitespace/punctuation significant), not normalized/trimmed equality - llm-as-judge: returns a continuous score 0..1 from {"score","reason"}, not a binary {"pass":...}; a numeric score key is required (missing key fails closed) - refusal-quality: graded 0..1 rubric score with a refusal classification, not a binary match against refusal_expected - citation-groundedness: marker mode reads metadata.citations (string|list) and scores the fraction of markers present, not 1.0/0.0 from expected_output - nDCG@k: linear gains g/log2(rank+1) (matches RetrievalNdcgAtKMetric), not the exponential 2^g-1 variant; correct the worked-example value to ~0.65 - report contract: match JsonReportRenderer keys — samples/scores + failures (not rows), metric_distributions (not histograms), the actual usage shape (observations/totals/reported/latency_ms), and cohort/adversarial shapes - metrics-catalog: align exact-match / judge / citation rows Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-16T23:25:07Z

@codex review — addressed the second round in 6892ff4: exact-match strict ===, llm-as-judge/refusal-quality continuous score 0..1, citation-groundedness marker mode reads metadata.citations (fractional), nDCG linear gains g/log2(rank+1), and the report-contract JSON now matches JsonReportRenderer (samples/scores/failures, metric_distributions, real usage shape). All verified against src/.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6892ff4256

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-16T23:32:39Z

+      "metrics": { "exact-match": { "mean": 0.95, "p50": 1.0, "p95": 1.0, "pass_rate": 0.95 } } },
+    { "name": null, "label": "(untagged)", "is_untagged": true, "sample_count": 3, "metrics": { } }
+  ],
+  "adversarial": null,


Fix the normal-run adversarial report shape

The public report contract documents adversarial as null for normal runs, but JsonReportRenderer always serializes adversarial from EvalReport::adversarialSummary(); the existing renderer test for an empty report asserts adversarial.total_samples === 0. UI clients generated from this contract will type/branch on null and then mis-handle real non-adversarial reports, which contain an object with zero samples instead.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T23:32:39Z

+use Padosoft\EvalHarness\Contracts\EmbeddingClient;
+
+$this->app->bind(EmbeddingClient::class, fn () => new class implements EmbeddingClient {
+    public function embed(string $text): array


Implement the documented embedding fake with embedMany

This fake client does not satisfy the actual EmbeddingClient contract, which defines embedMany(array $texts): array rather than embed(string $text). Anyone copying this snippet to make CI deterministic will get a PHP fatal for an unimplemented interface method instead of a working fake; the example should implement embedMany and return one vector per input in order.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T23:32:39Z

+| `cosine-embedding` | [semantic](/metrics/semantic-similarity) | cosine of sentence embeddings | string expected | embeddings | model, endpoint |
+| `bertscore-like` | [semantic](/metrics/semantic-similarity) | token-level greedy-match F1 | string expected | embeddings | model, endpoint |
+| `llm-as-judge` | [judge](/metrics/llm-as-judge) | graded `score` 0..1 from a rubric | string expected | chat | model, prompt_template |
+| `refusal-quality` | [judge](/metrics/llm-as-judge) | graded refusal-behavior `score` 0..1 | `metadata.refusal_expected` | chat | model, prompt_template |


Remove prompt_template from refusal-quality knobs

refusal-quality uses its hard-coded prompt in RefusalQualityMetric and never reads eval-harness.metrics.llm_as_judge.prompt_template; only llm-as-judge/calibration honor that config. Listing prompt_template as a knob for refusal-quality will mislead users trying to tune the adversarial/refusal rubric via EVAL_HARNESS_JUDGE_PROMPT_TEMPLATE, because their configured prompt will be ignored for this metric.

Useful? React with 👍 / 👎.

- semantic-similarity: implement the embedding fake against the real EmbeddingClient contract (embedMany(array): array, one vector per input), not a non-existent embed(string) - report-contract: the top-level `adversarial` block is always an object ({total_samples:0, categories:[], compliance_frameworks:[]} on a normal run), never null — JsonReportRenderer always serializes adversarialSummary() - metrics-catalog: refusal-quality uses a built-in prompt and does not read prompt_template — drop it from that metric's knobs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-16T23:43:18Z

@codex review — round 3 in a424237: embedding fake now implements embedMany(array): array, the top-level adversarial report block is documented as always-present (zeroed on normal runs), and prompt_template removed from refusal-quality knobs (built-in prompt). Verified against src/.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4242371d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-16T23:50:07Z

+    EVAL_HARNESS_EMBEDDINGS_ENDPOINT=https://api.openai.com/v1/embeddings
+    EVAL_HARNESS_EMBEDDINGS_MODEL=text-embedding-3-small
+    EVAL_HARNESS_JUDGE_ENDPOINT=https://api.openai.com/v1/chat/completions
+    EVAL_HARNESS_JUDGE_MODEL=gpt-4o-mini
+    EVAL_HARNESS_JUDGE_API_KEY=sk-...


Add the embeddings API key to OpenAI setup

In the OpenAI setup, this block configures an embeddings endpoint/model but only sets EVAL_HARNESS_JUDGE_API_KEY. The embedding client reads EVAL_HARNESS_EMBEDDINGS_API_KEY with fallback to OPENAI_API_KEY from config/eval-harness.php, not the judge key, so anyone following this example and then using cosine-embedding or bertscore-like sends unauthenticated embedding requests.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-16T23:50:07Z

+flowchart LR
+    RUN[adversarial run] --> CHK{failure-free<br/>and gate-clean?}
+    CHK -->|yes| REC[record summary +<br/>advance baseline]
+    CHK -->|no| SKIP[summary kept,<br/>baseline NOT advanced]


Document that failed gated runs are not recorded

For --regression-gate runs, the manifest store does not keep failed/gate-failed summaries: AdversarialRunManifestStore::shouldRecordRegressionGateResult() returns false when the gate failed or the report has metric failures. This diagram tells operators that those summaries are still kept, so they may rely on the manifest for failed-run audit/history that will never be written.

Useful? React with 👍 / 👎.

- installation: the OpenAI example only set the judge key, so cosine-embedding / bertscore-like would send unauthenticated embedding requests. Show OPENAI_API_KEY (the shared fallback for both judge and embeddings keys) and the optional per-metric EVAL_HARNESS_EMBEDDINGS_API_KEY - adversarial-testing: correct the manifest-recording diagram — for --regression-gate runs, a metric-failed or gate-failed run is NOT written to the manifest at all (no summary, no baseline), per AdversarialRunManifestStore::shouldRecordRegressionGateResult(); the manifest is not a complete audit log of failed gated runs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-16T23:55:17Z

@codex review — round 4 in b053751: OpenAI install example now sets OPENAI_API_KEY (shared fallback) + optional EVAL_HARNESS_EMBEDDINGS_API_KEY, and the adversarial manifest diagram now states that --regression-gate runs with metric/gate failures are not recorded at all. Verified against config + AdversarialRunManifestStore.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0537517ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T00:00:56Z

+    RUN[adversarial --regression-gate run] --> CHK{failure-free<br/>AND gate passed?}
+    CHK -->|yes| REC[record summary +<br/>advance baseline]
+    CHK -->|no| SKIP[NOT recorded —<br/>no summary, no baseline]


Show clean missing-baseline runs as recorded

For a first clean --regression-gate run with no compatible baseline, AdversarialRegressionGate returns missing-baseline, not pass, but AdversarialRunManifestStore::shouldRecordRegressionGateResult() still records it because the result is not failed and the entry has no metric failures. This diagram routes anything other than “gate passed” to “NOT recorded”, which misdocuments the baseline-seeding path and can make operators think the first clean gated run will not be stored for future comparisons.

Useful? React with 👍 / 👎.

Refine the adversarial manifest-recording diagram (round-4 over-corrected). Per AdversarialRunManifestStore::shouldRecordRegressionGateResult(), a gated run is recorded unless it failed the gate, has metric failures, or has a missing regression-metric aggregate. A first clean run with no baseline returns the missing-baseline status but IS recorded — that is how the baseline is seeded. Diagram + warning now state both halves: broken runs never seed a baseline, clean ones (including missing-baseline) always do. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T00:05:33Z

@codex review — round 5 in a4af2d5: refined the manifest-recording diagram to match shouldRecordRegressionGateResult() — a first clean run returns missing-baseline but IS recorded (seeds the baseline); only metric-failed, gate-failed, or missing-aggregate runs are excluded.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4af2d593d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T00:12:08Z

+        {
+            $engine->dataset('rag.factuality.fy2026')
+                ->loadFromYaml(base_path('eval/golden/factuality.yml'))
+                ->withMetrics(['exact-match', 'cosine-embedding'])


Keep the quickstart on offline metrics

For a fresh install following this quickstart, no embeddings provider or API key has been configured before the first eval-harness:run, but cosine-embedding invokes the OpenAI-compatible embeddings client. In that scenario the sample run makes an external embeddings request with no token, records a SampleFailure, and exits non-zero, so the five-minute quickstart fails before users see a report. Use only offline metrics here or add an explicit provider/API-key setup step before this registrar.

Useful? React with 👍 / 👎.

Keep the quickstart runnable on a fresh install with no API key: swap the registrar's cosine-embedding (which calls an embeddings provider and would SampleFailure + exit non-zero without a key) for the offline rouge-l metric, relabel the sample report row, and add a Tip pointing users to provider setup when they want semantic/judge metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T00:18:21Z

@codex review — round 6 in 54dc7c5: the quickstart now uses offline metrics only (exact-match, rouge-l) so a fresh install runs with no API key, with a Tip pointing to provider setup for semantic/judge metrics.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54dc7c56f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T00:23:19Z

+    Wire that into the same workflow that runs your PHPUnit suite and you have a
+    regression gate.


Add a real quality-threshold gate to the quickstart

In this quickstart, CI only consumes eval-harness:run's process status. I checked EvalCommand::handle: after a report is written it fails only when totalFailures() !== 0, so low exact-match/rouge-l scores still exit 0 as long as metrics execute cleanly. If the SUT returns wrong answers without throwing, this workflow passes instead of gating a regression; please add the JSON macro_f1 assertion here or make clear this is only an execution-health check and link users to the CI quality-threshold step.

Useful? React with 👍 / 👎.

Clarify the quickstart's exit-code semantics: it is an execution-health signal (non-zero only on captured metric failures), NOT a quality gate — a SUT that returns wrong-but-clean answers still exits 0. Add a Warning pointing to the macro_f1 quality-threshold step in the CI gate guide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T00:33:47Z

@codex review — round 7 in b150984: the quickstart now states the exit code is an execution-health signal (not a quality gate) and links to the macro_f1 quality-threshold step. This matches EvalCommand::handle (fails only on totalFailures).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1509842d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T00:39:37Z

+of declared metrics, and a CI gate that exits non-zero the moment quality
+drops.


Do not imply quality drops fail the command automatically

In a cleanly executed run with low metric scores, EvalCommand::handle() still returns success because it only fails when totalFailures() > 0; this sentence tells readers the CI gate exits non-zero as soon as quality drops, so teams following the intro can rely on the Artisan exit code and miss regressions unless they add the documented macro_f1 threshold check.

Useful? React with 👍 / 👎.

Reword the introduction's CI-gate claim: the raw command exit code is an execution-health signal and does not fail on merely low scores. The quality gate comes from asserting a macro_f1 threshold (or a metric erroring); link to the CI gate guide. Mirrors EvalCommand::handle (fails only on totalFailures). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T00:46:16Z

@codex review — round 8 in c6689ff: reworded the introduction so the raw exit code is described as an execution-health signal, with the quality gate coming from a macro_f1 threshold assertion (linked). Swept the rest of the docs for the same phrasing.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6689ff3d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T00:52:33Z

+
+```php
+use Padosoft\EvalHarness\Datasets\DatasetSample;
+use Padosoft\EvalHarness\Facades\EvalFacade as Eval;


Avoid aliasing the facade as Eval

When this snippet is copied into a PHP file, the import itself fails to parse because eval is a reserved language construct and cannot be used as a use ... as alias; the same pattern appears in the golden-datasets programmatic example. That makes the advertised facade examples unusable for readers, so they should call EvalFacade::... directly or use the registered global facade as \Eval::... instead.

Useful? React with 👍 / 👎.

`Eval` is a reserved word in PHP (T_EVAL), so `use ...EvalFacade as Eval;` and `Eval::...` are parse errors — the facade's own docblock notes this. Use `EvalFacade::` directly across core-concepts, golden-datasets, running-evaluations, and scoring-saved-outputs (code + prose + frontmatter), and add the matching import where the eval-set snippet was missing it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T01:02:15Z

@codex review — round 9 in 7b431c2: replaced the reserved-word Eval facade alias (a PHP parse error, per EvalFacade's own docblock) with EvalFacade:: directly across all four files, plus the missing import in the eval-set snippet.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b431c2870

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T01:09:09Z

+  reference exists, lean on metrics that judge the answer on its own merits
+  (e.g. refusal/helpfulness rubrics) rather than against a gold string.


Don't recommend refusal rubrics for online samples

When users have no runtime reference and follow this advice by setting the online metric to the built-in refusal-quality, every sampled JudgeLiveSampleJob will fail: the job constructs DatasetSample without any metadata, while RefusalQualityMetric requires metadata.refusal_expected. That means no OnlineScore rows or drift alerts are produced for sampled traffic, so this note should either limit the advice to custom metrics / custom llm-as-judge prompts or document a supported way to provide refusal metadata.

Useful? React with 👍 / 👎.

Online monitoring: stop recommending refusal rubrics for sampled traffic. The JudgeLiveSampleJob builds DatasetSample from id/input/expected with no metadata, but refusal-quality requires metadata.refusal_expected, so using it as the online metric fails every sample (no OnlineScore rows, no drift alerts). Reword to keep the llm-as-judge default (graded from input/output alone) and add a Warning steering refusal scoring to the adversarial lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T01:15:19Z

@codex review — round 10 in a32f375: online-monitoring note no longer recommends refusal-quality (the JudgeLiveSampleJob builds DatasetSample with no metadata, so it would fail every sample); keeps the llm-as-judge default and steers refusal scoring to the adversarial lane.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a32f37522b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T01:19:58Z

+  reference exists, keep the default `online.metric` of `llm-as-judge` (or a
+  custom `llm-as-judge` prompt that grades the answer on its own merits), which
+  scores from the input/output alone.


Clarify that the default judge still needs a reference

In the no-reference case this guidance will produce misleading online scores: JudgeLiveSampleJob passes the supplied expected into DatasetSample, and the default llm-as-judge prompt grades ACTUAL against EXPECTED rather than input/output alone. If operators pass an empty or placeholder reference based on this note, online drift monitoring will judge answers against that empty placeholder; only a custom prompt that explicitly ignores {expected} supports reference-less scoring.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T01:19:58Z

+## The response envelope
+
+Every JSON response carries a version discriminator:


Limit the schema envelope claim to success responses

This overstates the report API contract for error paths: several controllers throw NotFoundHttpException, UnprocessableEntityHttpException, or ServiceUnavailableHttpException for missing artifacts, malformed report JSON, and storage/cache failures, and Laravel renders those as standard error JSON rather than this schema_version envelope. UI clients that branch on schema_version for every response will break on those documented 404/422/503 cases, so this should either say successful JSON responses are enveloped or document the unversioned error shape.

Useful? React with 👍 / 👎.

- online-monitoring: correct the round-10 wording — the default llm-as-judge grades ACTUAL against EXPECTED, so it needs a real reference (an empty/placeholder makes scores meaningless). Reference-less online scoring requires a custom llm-as-judge prompt_template that ignores {expected} - report-api: limit the schema_version envelope claim to successful (2xx) responses; document that 404/422/503 are standard Laravel error JSON without the envelope, so clients must branch on HTTP status first Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T01:26:58Z

@codex review — round 11 in b22418b: online-monitoring now states the default llm-as-judge grades against EXPECTED (needs a real reference; reference-less requires a custom prompt ignoring {expected}); report-api limits the schema_version envelope to 2xx and documents 404/422/503 as standard Laravel error JSON.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b22418b9ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T01:33:10Z

+**`SampleInvocation`** DTO (`sample_id` + `input`) and the worker reconstructs the
+rest from the dataset registry. This avoids serialization failures and stops the


Document SampleInvocation as id/input only

For lazy-parallel runs, the worker does not reconstruct a DatasetSample from the registry: EvaluateSampleJob::handle() resolves the runner and passes only the queued SampleInvocation (whose public fields are id and input), while the producer scores the returned outputs later against its original dataset. As written, this tells queue users that expected output/metadata will be available in workers, which will fail for SampleRunner implementations that rely on that state.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T01:33:10Z

+**Error responses are not enveloped.** Missing artifacts/ids (`404`), malformed
+report JSON on a JSON-only endpoint (`422`), and storage/cache failures (`503`)
+are thrown as standard Laravel HTTP exceptions and rendered as Laravel's default
+error JSON (e.g. `{ "message": "..." }`) — **without** a `schema_version`. A
+client should branch on the **HTTP status first**, and only read
+`schema_version` / `data` on a `2xx` response.


Exclude disabled manifest 404s from the error-envelope rule

When adversarial manifest discovery is disabled (eval-harness.adversarial.manifests.disk is null), ManifestController::discoveryNotConfigured() returns a 404 JSON body containing schema_version, error: discovery_not_configured, and message, not Laravel's default { "message": "..." } error shape. This blanket statement will make API clients ignore the documented disabled-surface signal or treat the existing 404 envelope as a contract violation.

Useful? React with 👍 / 👎.

- evaluation-pipeline + decisions: correct the SampleInvocation description. Fields are id + input (not sample_id); the worker hands ONLY the invocation to the runner and does NOT reconstruct the full DatasetSample, so expected_output/metadata are unavailable in workers. Scoring runs in the producer against the original dataset. Add a warning that runners needing expected/metadata are serial-only - report-api: carve out the disabled-manifest 404 from the unenveloped-error rule — it returns an enveloped {schema_version, error: discovery_not_configured, message} body, a deliberate "feature off" signal Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T01:39:59Z

@codex review — round 12 in aab7456: SampleInvocation documented as id+input only with no worker-side reconstruction (expected/metadata unavailable in workers; scoring in the producer), and the disabled-manifest 404 carved out as an enveloped discovery_not_configured signal.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aab7456efa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T01:46:34Z

+  — a structured "feature off" signal, not a generic error. Treat a `404` on the
+  manifest endpoints as "discovery not configured" rather than "missing".


Check the manifest error body before treating 404 as disabled

When adversarial.manifests.disk is configured but a UI asks for a manifest name that doesn't exist, ManifestController::show() still returns a normal NotFoundHttpException 404 (the repository throws EvalRunException for the missing file), not the enveloped discovery_not_configured body. This sentence tells clients to treat every manifest 404 as feature-off, so they'd hide genuinely missing links; document checking error === 'discovery_not_configured' instead.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T01:46:35Z

+
+| Endpoint | Purpose |
+| --- | --- |
+| `GET /batches/live` | Active lazy-parallel batch ids + status. |


Don't promise status in the live batch list

For a client building the live-batches view from this catalog, GET /batches/live does not include any per-batch status: BatchLiveController::live() renders BatchLiveResource, which only emits each batch's id and expires_at; status is returned by /batches/{id}/progress. As written, UI clients will dereference a field that is never sent unless they make the progress call, so this row should either remove + status or say that status requires the progress endpoint.

Useful? React with 👍 / 👎.

report-api refinements: - manifest 404s: distinguish the enveloped discovery_not_configured (feature off) from a plain NotFoundHttpException (configured but the named manifest doesn't exist) — clients must check the error field, not treat every 404 as feature-off - /batches/live emits id + expires_at only; per-batch status comes from /batches/{id}/progress. Correct the catalog rows so clients don't dereference a status field the list endpoint never sends Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T01:51:09Z

@codex review — round 13 in 59c4b7a: manifest 404s now distinguish discovery_not_configured (feature off) from a plain missing-name 404, and /batches/live is documented as id + expires_at only (status comes from /batches/{id}/progress).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 59c4b7a46f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T01:57:31Z

+
+Every metric implements one method. It receives a sample and the SUT's actual


Describe the Metric contract accurately

The actual Padosoft\EvalHarness\Metrics\Metric interface requires both name() and score(), so saying every metric implements one method misleads users writing custom metrics. If they follow this prose and only implement score(), their class cannot satisfy the interface and will not be usable through the resolver; please update this sentence and the similar “single-method” wording elsewhere to reflect the two-method contract.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-17T01:57:31Z

+
+```mermaid
+flowchart LR
+    B[LazyParallelBatch] -->|every N done| PR["BatchProgressReporter::report()"]


Use the real checkpoint callback name

The BatchProgressReporter interface exposes reportCheckpoint(), not report(). A reader implementing a reporter from this diagram would end up with the wrong callback name and fail to implement the package contract, so the label should point to BatchProgressReporter::reportCheckpoint().

Useful? React with 👍 / 👎.

- Metric is a two-method interface (name() + score()), not one method — fix the "one method"/"single-method" wording in metrics/overview, core-concepts, and metrics-catalog so custom-metric authors implement both - BatchProgressReporter's callback is reportCheckpoint(), not report() — fix the batch-execution diagram label Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lopadova · 2026-06-17T02:02:29Z

@codex review — round 14 in 520d97b: Metric documented as a two-method interface (name() + score()) across overview/core-concepts/catalog, and the batch-execution diagram now labels the callback reportCheckpoint().

chatgpt-codex-connector · 2026-06-17T02:09:00Z

Codex Review: Didn't find any major issues. You're on a roll.

Reviewed commit: 520d97b345

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector Bot reviewed Jun 16, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

lopadova merged commit 69e486c into main Jun 17, 2026
6 checks passed

lopadova deleted the docs/mintlify-doc-site branch June 17, 2026 06:06

		'disk' => env('EVAL_HARNESS_ADVERSARIAL_DISK'),
		'path_prefix' => env('EVAL_HARNESS_ADVERSARIAL_PATH'),

		where $\mathrm{norm}$ trims and case-folds. Use it when the expected output is a
		single canonical fact — an id, a date, a country, a yes/no — where any deviation

		Wire that into the same workflow that runs your PHPUnit suite and you have a
		regression gate.

		of declared metrics, and a CI gate that exits non-zero the moment quality
		drops.

		reference exists, lean on metrics that judge the answer on its own merits
		(e.g. refusal/helpfulness rubrics) rather than against a gold string.

		## The response envelope

		Every JSON response carries a version discriminator:

		`SampleInvocation` DTO (`sample_id` + `input`) and the worker reconstructs the
		rest from the dataset registry. This avoids serialization failures and stops the

		— a structured "feature off" signal, not a generic error. Treat a `404` on the
		manifest endpoints as "discovery not configured" rather than "missing".


		Every metric implements one method. It receives a sample and the SUT's actual

Conversation

lopadova commented Jun 16, 2026

Summary

What's included

Accuracy

Notes / not yet verified

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lopadova commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lopadova commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lopadova commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lopadova commented Jun 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review