docs: professional Mintlify documentation site#46
Conversation
Add a complete /docs-site Mintlify documentation site published at doc.eval-harness.padosoft.com, modeled on the AskMyDocs deep-doc template. - docs.json (mint theme), favicon, and docs-site README - 31 MDX pages across 7 groups: Get Started, Guides, Metrics & Theory, Best Practices, Configuration & Operations, Architecture, Reference - Academic depth: KaTeX formulas for ROUGE-L, cosine, BERTScore, and the IR family (recall@k, MRR, nDCG@k), plus Mermaid flow/sequence diagrams - ADR narrative, evaluation pipeline, report contract, CLI/API/config refs - Content verified against the package source (commands, metrics, config keys, report-API endpoints) - README: add Official Documentation section linking the live docs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 036aeb7032
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| The package is **auto-discovered** — no `config/app.php` edits required. It adds | ||
| **zero migrations** to your schema by default; datasets are YAML and reports are | ||
| JSON/Markdown on a configured disk. |
There was a problem hiding this comment.
Correct the default migration behavior
This install guide says the package adds zero migrations by default and later presents the online-monitoring migration as publish-only, but the service provider unconditionally calls loadMigrationsFrom(__DIR__.'/../database/migrations'). In a host app that runs php artisan migrate after installing the package, Laravel will include the bundled create_eval_harness_online_scores_table migration even when online monitoring is disabled, so the docs can lead users to unexpected schema changes.
Useful? React with 👍 / 👎.
| php artisan eval-harness:run rag.factuality.fy2026 \ | ||
| --registrar="App\\Console\\EvalRegistrar" \ | ||
| --json --out=eval-report.json |
There was a problem hiding this comment.
Write CI artifacts to a real workspace path
In this workflow, --out=eval-report.json is a relative report path, so WritesEvalReports sends it to the configured reports disk/prefix by default rather than creating ./eval-report.json in the GitHub Actions workspace. The later artifact upload and the jq eval-report.json threshold example will therefore fail to find the report unless the command adds --raw-path or uploads/reads the configured storage path.
Useful? React with 👍 / 👎.
|
|
||
| ```json | ||
| { | ||
| "schema_version": "eval-harness.api.v1", |
There was a problem hiding this comment.
Use the actual Report API schema version
The runtime Report API envelope uses ReportApiSchema::VERSION, which is eval-harness.report-api.v1, and the route tests assert responses against that constant. A UI client built from this example would pin to eval-harness.api.v1 and reject every live response as unsupported, so the documented envelope version needs to match the emitted value.
Useful? React with 👍 / 👎.
| 'middleware' => env('EVAL_HARNESS_API_MIDDLEWARE', 'web,auth'), | ||
| 'trend' => [ | ||
| 'max_files_scanned' => env('EVAL_HARNESS_API_TREND_MAX_FILES', 5000), |
There was a problem hiding this comment.
Align the API config snippet with shipped defaults
This block does not match the published config: the default middleware is [], not web,auth, and the trend scan env var is EVAL_HARNESS_API_TREND_MAX_FILES_SCANNED. If an operator enables only EVAL_HARNESS_API_ENABLED=true based on this snippet, the routes are mounted without auth middleware, and setting EVAL_HARNESS_API_TREND_MAX_FILES has no effect.
Useful? React with 👍 / 👎.
| 'disk' => env('EVAL_HARNESS_ADVERSARIAL_DISK'), | ||
| 'path_prefix' => env('EVAL_HARNESS_ADVERSARIAL_PATH'), |
There was a problem hiding this comment.
Document the manifest discovery env vars accurately
The package reads EVAL_HARNESS_ADVERSARIAL_MANIFEST_DISK and EVAL_HARNESS_ADVERSARIAL_MANIFEST_PATH for HTTP manifest discovery. With the env names shown here, an operator can configure the disk/path exactly as documented but ManifestRepository::discoveryEnabled() still sees no disk and the /adversarial/manifests endpoints keep returning discovery_not_configured.
Useful? React with 👍 / 👎.
- Migrations: correct the "zero migrations" claim — the service provider auto-loads the eval_harness_online_scores migration, so `migrate` creates that one table (feature off by default; publishing only customizes it) - CI gate: add --raw-path so --out lands in the workspace where upload-artifact and jq read it (relative --out goes to the reports disk) - Report API: envelope schema_version is eval-harness.report-api.v1 (ReportApiSchema::VERSION), not eval-harness.api.v1 - Config api block: middleware default is [] (empty, no auth), and the trend env var is EVAL_HARNESS_API_TREND_MAX_FILES_SCANNED - Config adversarial block: env vars are EVAL_HARNESS_ADVERSARIAL_MANIFEST_DISK and EVAL_HARNESS_ADVERSARIAL_MANIFEST_PATH (default eval-harness/adversarial/manifests) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review — addressed all five P2 findings in 9200237: migration auto-load wording, CI |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9200237a00
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "rows": [ | ||
| { "sample_id": "capital-france", "tags": ["geography", "easy"], | ||
| "metric": "exact-match", "score": 1.0, "error": null, "details": {} } | ||
| ] |
There was a problem hiding this comment.
Document the actual JSON report row contract
The public report-contract example documents a top-level rows array, but JsonReportRenderer::render() emits samples with nested scores plus a separate failures array; the same snippet also uses histograms while the renderer emits metric_distributions. A UI or CI script built from this page will read missing fields from real --json reports, so the contract example should match the renderer's actual keys before publishing it as stable documentation.
Useful? React with 👍 / 👎.
| Sends the sample `input`, `expected_output`, and the SUT's `actual_output` to the | ||
| configured judge model with a prompt template, and reads back a JSON verdict | ||
| `{"pass": true|false}`. The score is `1.0` on pass, `0.0` on fail. Configure the | ||
| endpoint, model, timeout, and an optional custom `prompt_template` under |
There was a problem hiding this comment.
Document llm-as-judge's score response shape
This states that llm-as-judge reads {"pass": true|false} and returns a binary score, but LlmAsJudgeMetric::decodeStrictJson() requires a numeric score key and score() returns that continuous 0..1 value. Custom judge prompts/providers following the documented pass schema will be recorded as SampleFailures, and readers will misunderstand graded judge scores.
Useful? React with 👍 / 👎.
| where $\mathrm{norm}$ trims and case-folds. Use it when the expected output is a | ||
| single canonical fact — an id, a date, a country, a yes/no — where any deviation |
There was a problem hiding this comment.
Document exact-match as strict equality
The docs say exact-match trims and case-folds, but ExactMatchMetric::score() compares expectedOutput === actualOutput; whitespace, case, and punctuation are all significant. Users relying on this page will build datasets that they expect to pass after normalization but that actually score 0.0.
Useful? React with 👍 / 👎.
| **Marker mode (baseline).** The `expected_output` declares the citation markers | ||
| that should appear (e.g. `[1]`, `[source:refunds]`); the score is `1.0` when all | ||
| declared markers are present in the actual output, else `0.0`. |
There was a problem hiding this comment.
Put citation markers in metadata, not expected_output
Marker mode is documented as reading citation markers from expected_output, but CitationGroundednessMetric::citationsFor() ignores expected_output and requires metadata.citations to contain the marker string(s). A dataset authored from this guidance will fail every cited sample with a metric contract error instead of producing groundedness scores.
Useful? React with 👍 / 👎.
| Discounted Cumulative Gain at $k$, with gain $g_i$ for the document at rank $i$: | ||
|
|
||
| $$ | ||
| \text{DCG@}k = \sum_{i=1}^{k} \frac{2^{g_i} - 1}{\log_2(i + 1)} | ||
| $$ |
There was a problem hiding this comment.
Match the implemented nDCG gain formula
This formula documents exponential gains (2^g - 1), but RetrievalNdcgAtKMetric::scoreRanked() computes DCG/IDCG with linear gains ($gain / log(rank + 1, 2)). For graded relevance maps, the documented nDCG values and thresholds will not match the package's actual scores.
Useful? React with 👍 / 👎.
Second round of accuracy fixes verified against the package source:
- exact-match: document strict === (case/whitespace/punctuation significant),
not normalized/trimmed equality
- llm-as-judge: returns a continuous score 0..1 from {"score","reason"}, not a
binary {"pass":...}; a numeric score key is required (missing key fails closed)
- refusal-quality: graded 0..1 rubric score with a refusal classification, not a
binary match against refusal_expected
- citation-groundedness: marker mode reads metadata.citations (string|list) and
scores the fraction of markers present, not 1.0/0.0 from expected_output
- nDCG@k: linear gains g/log2(rank+1) (matches RetrievalNdcgAtKMetric), not the
exponential 2^g-1 variant; correct the worked-example value to ~0.65
- report contract: match JsonReportRenderer keys — samples/scores + failures
(not rows), metric_distributions (not histograms), the actual usage shape
(observations/totals/reported/latency_ms), and cohort/adversarial shapes
- metrics-catalog: align exact-match / judge / citation rows
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review — addressed the second round in 6892ff4: exact-match strict |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6892ff4256
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "metrics": { "exact-match": { "mean": 0.95, "p50": 1.0, "p95": 1.0, "pass_rate": 0.95 } } }, | ||
| { "name": null, "label": "(untagged)", "is_untagged": true, "sample_count": 3, "metrics": { } } | ||
| ], | ||
| "adversarial": null, |
There was a problem hiding this comment.
Fix the normal-run adversarial report shape
The public report contract documents adversarial as null for normal runs, but JsonReportRenderer always serializes adversarial from EvalReport::adversarialSummary(); the existing renderer test for an empty report asserts adversarial.total_samples === 0. UI clients generated from this contract will type/branch on null and then mis-handle real non-adversarial reports, which contain an object with zero samples instead.
Useful? React with 👍 / 👎.
| use Padosoft\EvalHarness\Contracts\EmbeddingClient; | ||
|
|
||
| $this->app->bind(EmbeddingClient::class, fn () => new class implements EmbeddingClient { | ||
| public function embed(string $text): array |
There was a problem hiding this comment.
Implement the documented embedding fake with embedMany
This fake client does not satisfy the actual EmbeddingClient contract, which defines embedMany(array $texts): array rather than embed(string $text). Anyone copying this snippet to make CI deterministic will get a PHP fatal for an unimplemented interface method instead of a working fake; the example should implement embedMany and return one vector per input in order.
Useful? React with 👍 / 👎.
| | `cosine-embedding` | [semantic](/metrics/semantic-similarity) | cosine of sentence embeddings | string expected | embeddings | model, endpoint | | ||
| | `bertscore-like` | [semantic](/metrics/semantic-similarity) | token-level greedy-match F1 | string expected | embeddings | model, endpoint | | ||
| | `llm-as-judge` | [judge](/metrics/llm-as-judge) | graded `score` 0..1 from a rubric | string expected | chat | model, prompt_template | | ||
| | `refusal-quality` | [judge](/metrics/llm-as-judge) | graded refusal-behavior `score` 0..1 | `metadata.refusal_expected` | chat | model, prompt_template | |
There was a problem hiding this comment.
Remove prompt_template from refusal-quality knobs
refusal-quality uses its hard-coded prompt in RefusalQualityMetric and never reads eval-harness.metrics.llm_as_judge.prompt_template; only llm-as-judge/calibration honor that config. Listing prompt_template as a knob for refusal-quality will mislead users trying to tune the adversarial/refusal rubric via EVAL_HARNESS_JUDGE_PROMPT_TEMPLATE, because their configured prompt will be ignored for this metric.
Useful? React with 👍 / 👎.
- semantic-similarity: implement the embedding fake against the real
EmbeddingClient contract (embedMany(array): array, one vector per input),
not a non-existent embed(string)
- report-contract: the top-level `adversarial` block is always an object
({total_samples:0, categories:[], compliance_frameworks:[]} on a normal run),
never null — JsonReportRenderer always serializes adversarialSummary()
- metrics-catalog: refusal-quality uses a built-in prompt and does not read
prompt_template — drop it from that metric's knobs
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4242371d8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| EVAL_HARNESS_EMBEDDINGS_ENDPOINT=https://api.openai.com/v1/embeddings | ||
| EVAL_HARNESS_EMBEDDINGS_MODEL=text-embedding-3-small | ||
| EVAL_HARNESS_JUDGE_ENDPOINT=https://api.openai.com/v1/chat/completions | ||
| EVAL_HARNESS_JUDGE_MODEL=gpt-4o-mini | ||
| EVAL_HARNESS_JUDGE_API_KEY=sk-... |
There was a problem hiding this comment.
Add the embeddings API key to OpenAI setup
In the OpenAI setup, this block configures an embeddings endpoint/model but only sets EVAL_HARNESS_JUDGE_API_KEY. The embedding client reads EVAL_HARNESS_EMBEDDINGS_API_KEY with fallback to OPENAI_API_KEY from config/eval-harness.php, not the judge key, so anyone following this example and then using cosine-embedding or bertscore-like sends unauthenticated embedding requests.
Useful? React with 👍 / 👎.
| flowchart LR | ||
| RUN[adversarial run] --> CHK{failure-free<br/>and gate-clean?} | ||
| CHK -->|yes| REC[record summary +<br/>advance baseline] | ||
| CHK -->|no| SKIP[summary kept,<br/>baseline NOT advanced] |
There was a problem hiding this comment.
Document that failed gated runs are not recorded
For --regression-gate runs, the manifest store does not keep failed/gate-failed summaries: AdversarialRunManifestStore::shouldRecordRegressionGateResult() returns false when the gate failed or the report has metric failures. This diagram tells operators that those summaries are still kept, so they may rely on the manifest for failed-run audit/history that will never be written.
Useful? React with 👍 / 👎.
- installation: the OpenAI example only set the judge key, so cosine-embedding / bertscore-like would send unauthenticated embedding requests. Show OPENAI_API_KEY (the shared fallback for both judge and embeddings keys) and the optional per-metric EVAL_HARNESS_EMBEDDINGS_API_KEY - adversarial-testing: correct the manifest-recording diagram — for --regression-gate runs, a metric-failed or gate-failed run is NOT written to the manifest at all (no summary, no baseline), per AdversarialRunManifestStore::shouldRecordRegressionGateResult(); the manifest is not a complete audit log of failed gated runs Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review — round 4 in b053751: OpenAI install example now sets OPENAI_API_KEY (shared fallback) + optional EVAL_HARNESS_EMBEDDINGS_API_KEY, and the adversarial manifest diagram now states that --regression-gate runs with metric/gate failures are not recorded at all. Verified against config + AdversarialRunManifestStore. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0537517ad
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| RUN[adversarial --regression-gate run] --> CHK{failure-free<br/>AND gate passed?} | ||
| CHK -->|yes| REC[record summary +<br/>advance baseline] | ||
| CHK -->|no| SKIP[NOT recorded —<br/>no summary, no baseline] |
There was a problem hiding this comment.
Show clean missing-baseline runs as recorded
For a first clean --regression-gate run with no compatible baseline, AdversarialRegressionGate returns missing-baseline, not pass, but AdversarialRunManifestStore::shouldRecordRegressionGateResult() still records it because the result is not failed and the entry has no metric failures. This diagram routes anything other than “gate passed” to “NOT recorded”, which misdocuments the baseline-seeding path and can make operators think the first clean gated run will not be stored for future comparisons.
Useful? React with 👍 / 👎.
Refine the adversarial manifest-recording diagram (round-4 over-corrected). Per AdversarialRunManifestStore::shouldRecordRegressionGateResult(), a gated run is recorded unless it failed the gate, has metric failures, or has a missing regression-metric aggregate. A first clean run with no baseline returns the missing-baseline status but IS recorded — that is how the baseline is seeded. Diagram + warning now state both halves: broken runs never seed a baseline, clean ones (including missing-baseline) always do. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4af2d593d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| { | ||
| $engine->dataset('rag.factuality.fy2026') | ||
| ->loadFromYaml(base_path('eval/golden/factuality.yml')) | ||
| ->withMetrics(['exact-match', 'cosine-embedding']) |
There was a problem hiding this comment.
Keep the quickstart on offline metrics
For a fresh install following this quickstart, no embeddings provider or API key has been configured before the first eval-harness:run, but cosine-embedding invokes the OpenAI-compatible embeddings client. In that scenario the sample run makes an external embeddings request with no token, records a SampleFailure, and exits non-zero, so the five-minute quickstart fails before users see a report. Use only offline metrics here or add an explicit provider/API-key setup step before this registrar.
Useful? React with 👍 / 👎.
Keep the quickstart runnable on a fresh install with no API key: swap the registrar's cosine-embedding (which calls an embeddings provider and would SampleFailure + exit non-zero without a key) for the offline rouge-l metric, relabel the sample report row, and add a Tip pointing users to provider setup when they want semantic/judge metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54dc7c56f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Wire that into the same workflow that runs your PHPUnit suite and you have a | ||
| regression gate. |
There was a problem hiding this comment.
Add a real quality-threshold gate to the quickstart
In this quickstart, CI only consumes eval-harness:run's process status. I checked EvalCommand::handle: after a report is written it fails only when totalFailures() !== 0, so low exact-match/rouge-l scores still exit 0 as long as metrics execute cleanly. If the SUT returns wrong answers without throwing, this workflow passes instead of gating a regression; please add the JSON macro_f1 assertion here or make clear this is only an execution-health check and link users to the CI quality-threshold step.
Useful? React with 👍 / 👎.
Clarify the quickstart's exit-code semantics: it is an execution-health signal (non-zero only on captured metric failures), NOT a quality gate — a SUT that returns wrong-but-clean answers still exits 0. Add a Warning pointing to the macro_f1 quality-threshold step in the CI gate guide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1509842d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| of declared metrics, and a CI gate that exits non-zero the moment quality | ||
| drops. |
There was a problem hiding this comment.
Do not imply quality drops fail the command automatically
In a cleanly executed run with low metric scores, EvalCommand::handle() still returns success because it only fails when totalFailures() > 0; this sentence tells readers the CI gate exits non-zero as soon as quality drops, so teams following the intro can rely on the Artisan exit code and miss regressions unless they add the documented macro_f1 threshold check.
Useful? React with 👍 / 👎.
Reword the introduction's CI-gate claim: the raw command exit code is an execution-health signal and does not fail on merely low scores. The quality gate comes from asserting a macro_f1 threshold (or a metric erroring); link to the CI gate guide. Mirrors EvalCommand::handle (fails only on totalFailures). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6689ff3d6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ```php | ||
| use Padosoft\EvalHarness\Datasets\DatasetSample; | ||
| use Padosoft\EvalHarness\Facades\EvalFacade as Eval; |
There was a problem hiding this comment.
Avoid aliasing the facade as Eval
When this snippet is copied into a PHP file, the import itself fails to parse because eval is a reserved language construct and cannot be used as a use ... as alias; the same pattern appears in the golden-datasets programmatic example. That makes the advertised facade examples unusable for readers, so they should call EvalFacade::... directly or use the registered global facade as \Eval::... instead.
Useful? React with 👍 / 👎.
`Eval` is a reserved word in PHP (T_EVAL), so `use ...EvalFacade as Eval;` and `Eval::...` are parse errors — the facade's own docblock notes this. Use `EvalFacade::` directly across core-concepts, golden-datasets, running-evaluations, and scoring-saved-outputs (code + prose + frontmatter), and add the matching import where the eval-set snippet was missing it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b431c2870
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| reference exists, lean on metrics that judge the answer on its own merits | ||
| (e.g. refusal/helpfulness rubrics) rather than against a gold string. |
There was a problem hiding this comment.
Don't recommend refusal rubrics for online samples
When users have no runtime reference and follow this advice by setting the online metric to the built-in refusal-quality, every sampled JudgeLiveSampleJob will fail: the job constructs DatasetSample without any metadata, while RefusalQualityMetric requires metadata.refusal_expected. That means no OnlineScore rows or drift alerts are produced for sampled traffic, so this note should either limit the advice to custom metrics / custom llm-as-judge prompts or document a supported way to provide refusal metadata.
Useful? React with 👍 / 👎.
Online monitoring: stop recommending refusal rubrics for sampled traffic. The JudgeLiveSampleJob builds DatasetSample from id/input/expected with no metadata, but refusal-quality requires metadata.refusal_expected, so using it as the online metric fails every sample (no OnlineScore rows, no drift alerts). Reword to keep the llm-as-judge default (graded from input/output alone) and add a Warning steering refusal scoring to the adversarial lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a32f37522b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| reference exists, keep the default `online.metric` of `llm-as-judge` (or a | ||
| custom `llm-as-judge` prompt that grades the answer on its own merits), which | ||
| scores from the input/output alone. |
There was a problem hiding this comment.
Clarify that the default judge still needs a reference
In the no-reference case this guidance will produce misleading online scores: JudgeLiveSampleJob passes the supplied expected into DatasetSample, and the default llm-as-judge prompt grades ACTUAL against EXPECTED rather than input/output alone. If operators pass an empty or placeholder reference based on this note, online drift monitoring will judge answers against that empty placeholder; only a custom prompt that explicitly ignores {expected} supports reference-less scoring.
Useful? React with 👍 / 👎.
| ## The response envelope | ||
|
|
||
| Every JSON response carries a version discriminator: |
There was a problem hiding this comment.
Limit the schema envelope claim to success responses
This overstates the report API contract for error paths: several controllers throw NotFoundHttpException, UnprocessableEntityHttpException, or ServiceUnavailableHttpException for missing artifacts, malformed report JSON, and storage/cache failures, and Laravel renders those as standard error JSON rather than this schema_version envelope. UI clients that branch on schema_version for every response will break on those documented 404/422/503 cases, so this should either say successful JSON responses are enveloped or document the unversioned error shape.
Useful? React with 👍 / 👎.
- online-monitoring: correct the round-10 wording — the default llm-as-judge
grades ACTUAL against EXPECTED, so it needs a real reference (an empty/placeholder
makes scores meaningless). Reference-less online scoring requires a custom
llm-as-judge prompt_template that ignores {expected}
- report-api: limit the schema_version envelope claim to successful (2xx)
responses; document that 404/422/503 are standard Laravel error JSON without
the envelope, so clients must branch on HTTP status first
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review — round 11 in b22418b: online-monitoring now states the default llm-as-judge grades against EXPECTED (needs a real reference; reference-less requires a custom prompt ignoring {expected}); report-api limits the schema_version envelope to 2xx and documents 404/422/503 as standard Laravel error JSON. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b22418b9ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| **`SampleInvocation`** DTO (`sample_id` + `input`) and the worker reconstructs the | ||
| rest from the dataset registry. This avoids serialization failures and stops the |
There was a problem hiding this comment.
Document SampleInvocation as id/input only
For lazy-parallel runs, the worker does not reconstruct a DatasetSample from the registry: EvaluateSampleJob::handle() resolves the runner and passes only the queued SampleInvocation (whose public fields are id and input), while the producer scores the returned outputs later against its original dataset. As written, this tells queue users that expected output/metadata will be available in workers, which will fail for SampleRunner implementations that rely on that state.
Useful? React with 👍 / 👎.
| **Error responses are not enveloped.** Missing artifacts/ids (`404`), malformed | ||
| report JSON on a JSON-only endpoint (`422`), and storage/cache failures (`503`) | ||
| are thrown as standard Laravel HTTP exceptions and rendered as Laravel's default | ||
| error JSON (e.g. `{ "message": "..." }`) — **without** a `schema_version`. A | ||
| client should branch on the **HTTP status first**, and only read | ||
| `schema_version` / `data` on a `2xx` response. |
There was a problem hiding this comment.
Exclude disabled manifest 404s from the error-envelope rule
When adversarial manifest discovery is disabled (eval-harness.adversarial.manifests.disk is null), ManifestController::discoveryNotConfigured() returns a 404 JSON body containing schema_version, error: discovery_not_configured, and message, not Laravel's default { "message": "..." } error shape. This blanket statement will make API clients ignore the documented disabled-surface signal or treat the existing 404 envelope as a contract violation.
Useful? React with 👍 / 👎.
- evaluation-pipeline + decisions: correct the SampleInvocation description.
Fields are id + input (not sample_id); the worker hands ONLY the invocation to
the runner and does NOT reconstruct the full DatasetSample, so
expected_output/metadata are unavailable in workers. Scoring runs in the
producer against the original dataset. Add a warning that runners needing
expected/metadata are serial-only
- report-api: carve out the disabled-manifest 404 from the unenveloped-error
rule — it returns an enveloped {schema_version, error: discovery_not_configured,
message} body, a deliberate "feature off" signal
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aab7456efa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| — a structured "feature off" signal, not a generic error. Treat a `404` on the | ||
| manifest endpoints as "discovery not configured" rather than "missing". |
There was a problem hiding this comment.
Check the manifest error body before treating 404 as disabled
When adversarial.manifests.disk is configured but a UI asks for a manifest name that doesn't exist, ManifestController::show() still returns a normal NotFoundHttpException 404 (the repository throws EvalRunException for the missing file), not the enveloped discovery_not_configured body. This sentence tells clients to treat every manifest 404 as feature-off, so they'd hide genuinely missing links; document checking error === 'discovery_not_configured' instead.
Useful? React with 👍 / 👎.
|
|
||
| | Endpoint | Purpose | | ||
| | --- | --- | | ||
| | `GET /batches/live` | Active lazy-parallel batch ids + status. | |
There was a problem hiding this comment.
Don't promise status in the live batch list
For a client building the live-batches view from this catalog, GET /batches/live does not include any per-batch status: BatchLiveController::live() renders BatchLiveResource, which only emits each batch's id and expires_at; status is returned by /batches/{id}/progress. As written, UI clients will dereference a field that is never sent unless they make the progress call, so this row should either remove + status or say that status requires the progress endpoint.
Useful? React with 👍 / 👎.
report-api refinements:
- manifest 404s: distinguish the enveloped discovery_not_configured (feature
off) from a plain NotFoundHttpException (configured but the named manifest
doesn't exist) — clients must check the error field, not treat every 404 as
feature-off
- /batches/live emits id + expires_at only; per-batch status comes from
/batches/{id}/progress. Correct the catalog rows so clients don't dereference a
status field the list endpoint never sends
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 59c4b7a46f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| Every metric implements one method. It receives a sample and the SUT's actual |
There was a problem hiding this comment.
Describe the Metric contract accurately
The actual Padosoft\EvalHarness\Metrics\Metric interface requires both name() and score(), so saying every metric implements one method misleads users writing custom metrics. If they follow this prose and only implement score(), their class cannot satisfy the interface and will not be usable through the resolver; please update this sentence and the similar “single-method” wording elsewhere to reflect the two-method contract.
Useful? React with 👍 / 👎.
|
|
||
| ```mermaid | ||
| flowchart LR | ||
| B[LazyParallelBatch] -->|every N done| PR["BatchProgressReporter::report()"] |
There was a problem hiding this comment.
Use the real checkpoint callback name
The BatchProgressReporter interface exposes reportCheckpoint(), not report(). A reader implementing a reporter from this diagram would end up with the wrong callback name and fail to implement the package contract, so the label should point to BatchProgressReporter::reportCheckpoint().
Useful? React with 👍 / 👎.
- Metric is a two-method interface (name() + score()), not one method — fix the "one method"/"single-method" wording in metrics/overview, core-concepts, and metrics-catalog so custom-metric authors implement both - BatchProgressReporter's callback is reportCheckpoint(), not report() — fix the batch-execution diagram label Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Codex Review: Didn't find any major issues. You're on a roll. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Adds a complete, professional Mintlify documentation site under
/docs-site, intended to publish at doc.eval-harness.padosoft.com. Modeled on the AskMyDocs deep-doc template (motivation → theory → design with a diagram → contract → ADR rationale → worked example → gotchas), written from scratch for this package.What's included
docs.json(mint theme, teal palette, 7-group nav),favicon.svg, and adocs-site/README.mdwith preview/deploy/authoring instructions.flowchartandsequenceDiagramdiagrams throughout.## Official Documentationsection linking the live docs, plus a Table-of-Contents entry.Accuracy
Content was verified against the package source — exact Artisan command signatures (
eval-harness:run,eval-harness:adversarial+eval:adversarialalias,eval-harness:calibrate-judge), config keys, metric aliases, the evaluation pipeline (incl. theSampleInvocationDTO and lazy-parallel result store), and the read-only report-API endpoints and schema discriminators.Nav parity confirmed: every
docs.jsonentry maps to an existing.mdxfile.Notes / not yet verified
mint devwas not run locally (no Node/Mintlify CLI in the working environment), so the final rendering of Mermaid diagrams and KaTeX math should be confirmed with a local preview before/at deploy.docs-site/and bind thedoc.eval-harness.padosoft.comcustom domain.Test plan
cd docs-site && mint devrenders with no broken-link errors🤖 Generated with Claude Code