Skip to content

docs: professional Mintlify documentation site#46

Merged
lopadova merged 15 commits into
mainfrom
docs/mintlify-doc-site
Jun 17, 2026
Merged

docs: professional Mintlify documentation site#46
lopadova merged 15 commits into
mainfrom
docs/mintlify-doc-site

Conversation

@lopadova

Copy link
Copy Markdown
Contributor

Summary

Adds a complete, professional Mintlify documentation site under /docs-site, intended to publish at doc.eval-harness.padosoft.com. Modeled on the AskMyDocs deep-doc template (motivation → theory → design with a diagram → contract → ADR rationale → worked example → gotchas), written from scratch for this package.

What's included

  • Foundationdocs.json (mint theme, teal palette, 7-group nav), favicon.svg, and a docs-site/README.md with preview/deploy/authoring instructions.
  • 31 MDX pages across 7 groups:
    • Get Started — introduction, quickstart, installation, core-concepts
    • Guides — golden-datasets, running-evaluations, ci-gate, scoring-saved-outputs, judge-calibration, online-monitoring, adversarial-testing
    • Metrics & Theory — overview, lexical-and-structural, semantic-similarity, llm-as-judge, retrieval-ranking, ordinal-and-aggregation
    • Best Practices — regression-gating, trustworthy-judges, safety-and-red-teaming
    • Configuration & Operations — configuration, batch-execution, horizon-and-queues, troubleshooting
    • Architecture — overview, evaluation-pipeline, report-contract, decisions (ADR)
    • Reference — cli, report-api, metrics-catalog
  • Academic depth — KaTeX formulas for ROUGE-L (LCS F-measure), cosine similarity, BERTScore-like greedy matching, and the IR family (recall@k, MRR, full nDCG@k / DCG / IDCG derivation); Mermaid flowchart and sequenceDiagram diagrams throughout.
  • README — new ## Official Documentation section linking the live docs, plus a Table-of-Contents entry.

Accuracy

Content was verified against the package source — exact Artisan command signatures (eval-harness:run, eval-harness:adversarial + eval:adversarial alias, eval-harness:calibrate-judge), config keys, metric aliases, the evaluation pipeline (incl. the SampleInvocation DTO and lazy-parallel result store), and the read-only report-API endpoints and schema discriminators.

Nav parity confirmed: every docs.json entry maps to an existing .mdx file.

Notes / not yet verified

  • mint dev was not run locally (no Node/Mintlify CLI in the working environment), so the final rendering of Mermaid diagrams and KaTeX math should be confirmed with a local preview before/at deploy.
  • Deploy step (outside this PR): connect the Mintlify GitHub App with content directory docs-site/ and bind the doc.eval-harness.padosoft.com custom domain.

Test plan

  • cd docs-site && mint dev renders with no broken-link errors
  • Mermaid diagrams and LaTeX formulas render correctly
  • Custom domain resolves after Mintlify connection

🤖 Generated with Claude Code

Add a complete /docs-site Mintlify documentation site published at
doc.eval-harness.padosoft.com, modeled on the AskMyDocs deep-doc template.

- docs.json (mint theme), favicon, and docs-site README
- 31 MDX pages across 7 groups: Get Started, Guides, Metrics & Theory,
  Best Practices, Configuration & Operations, Architecture, Reference
- Academic depth: KaTeX formulas for ROUGE-L, cosine, BERTScore, and the
  IR family (recall@k, MRR, nDCG@k), plus Mermaid flow/sequence diagrams
- ADR narrative, evaluation pipeline, report contract, CLI/API/config refs
- Content verified against the package source (commands, metrics, config
  keys, report-API endpoints)
- README: add Official Documentation section linking the live docs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 036aeb7032

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/installation.mdx Outdated
Comment on lines +12 to +15

The package is **auto-discovered** — no `config/app.php` edits required. It adds
**zero migrations** to your schema by default; datasets are YAML and reports are
JSON/Markdown on a configured disk.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct the default migration behavior

This install guide says the package adds zero migrations by default and later presents the online-monitoring migration as publish-only, but the service provider unconditionally calls loadMigrationsFrom(__DIR__.'/../database/migrations'). In a host app that runs php artisan migrate after installing the package, Laravel will include the bundled create_eval_harness_online_scores_table migration even when online monitoring is disabled, so the docs can lead users to unexpected schema changes.

Useful? React with 👍 / 👎.

Comment thread docs-site/guides/ci-gate.mdx Outdated
Comment on lines +63 to +65
php artisan eval-harness:run rag.factuality.fy2026 \
--registrar="App\\Console\\EvalRegistrar" \
--json --out=eval-report.json

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Write CI artifacts to a real workspace path

In this workflow, --out=eval-report.json is a relative report path, so WritesEvalReports sends it to the configured reports disk/prefix by default rather than creating ./eval-report.json in the GitHub Actions workspace. The later artifact upload and the jq eval-report.json threshold example will therefore fail to find the report unless the command adds --raw-path or uploads/reads the configured storage path.

Useful? React with 👍 / 👎.

Comment thread docs-site/reference/report-api.mdx Outdated

```json
{
"schema_version": "eval-harness.api.v1",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the actual Report API schema version

The runtime Report API envelope uses ReportApiSchema::VERSION, which is eval-harness.report-api.v1, and the route tests assert responses against that constant. A UI client built from this example would pin to eval-harness.api.v1 and reject every live response as unsupported, so the documented envelope version needs to match the emitted value.

Useful? React with 👍 / 👎.

Comment thread docs-site/configuration.mdx Outdated
Comment on lines +151 to +153
'middleware' => env('EVAL_HARNESS_API_MIDDLEWARE', 'web,auth'),
'trend' => [
'max_files_scanned' => env('EVAL_HARNESS_API_TREND_MAX_FILES', 5000),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align the API config snippet with shipped defaults

This block does not match the published config: the default middleware is [], not web,auth, and the trend scan env var is EVAL_HARNESS_API_TREND_MAX_FILES_SCANNED. If an operator enables only EVAL_HARNESS_API_ENABLED=true based on this snippet, the routes are mounted without auth middleware, and setting EVAL_HARNESS_API_TREND_MAX_FILES has no effect.

Useful? React with 👍 / 👎.

Comment thread docs-site/configuration.mdx Outdated
Comment on lines +172 to +173
'disk' => env('EVAL_HARNESS_ADVERSARIAL_DISK'),
'path_prefix' => env('EVAL_HARNESS_ADVERSARIAL_PATH'),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document the manifest discovery env vars accurately

The package reads EVAL_HARNESS_ADVERSARIAL_MANIFEST_DISK and EVAL_HARNESS_ADVERSARIAL_MANIFEST_PATH for HTTP manifest discovery. With the env names shown here, an operator can configure the disk/path exactly as documented but ManifestRepository::discoveryEnabled() still sees no disk and the /adversarial/manifests endpoints keep returning discovery_not_configured.

Useful? React with 👍 / 👎.

- Migrations: correct the "zero migrations" claim — the service provider
  auto-loads the eval_harness_online_scores migration, so `migrate` creates
  that one table (feature off by default; publishing only customizes it)
- CI gate: add --raw-path so --out lands in the workspace where
  upload-artifact and jq read it (relative --out goes to the reports disk)
- Report API: envelope schema_version is eval-harness.report-api.v1
  (ReportApiSchema::VERSION), not eval-harness.api.v1
- Config api block: middleware default is [] (empty, no auth), and the trend
  env var is EVAL_HARNESS_API_TREND_MAX_FILES_SCANNED
- Config adversarial block: env vars are EVAL_HARNESS_ADVERSARIAL_MANIFEST_DISK
  and EVAL_HARNESS_ADVERSARIAL_MANIFEST_PATH (default eval-harness/adversarial/manifests)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — addressed all five P2 findings in 9200237: migration auto-load wording, CI --raw-path, Report API envelope eval-harness.report-api.v1, empty middleware default + EVAL_HARNESS_API_TREND_MAX_FILES_SCANNED, and the EVAL_HARNESS_ADVERSARIAL_MANIFEST_{DISK,PATH} env names. All verified against the package source.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9200237a00

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +64 to +67
"rows": [
{ "sample_id": "capital-france", "tags": ["geography", "easy"],
"metric": "exact-match", "score": 1.0, "error": null, "details": {} }
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document the actual JSON report row contract

The public report-contract example documents a top-level rows array, but JsonReportRenderer::render() emits samples with nested scores plus a separate failures array; the same snippet also uses histograms while the renderer emits metric_distributions. A UI or CI script built from this page will read missing fields from real --json reports, so the contract example should match the renderer's actual keys before publishing it as stable documentation.

Useful? React with 👍 / 👎.

Comment thread docs-site/metrics/llm-as-judge.mdx Outdated
Comment on lines +47 to +50
Sends the sample `input`, `expected_output`, and the SUT's `actual_output` to the
configured judge model with a prompt template, and reads back a JSON verdict
`{"pass": true|false}`. The score is `1.0` on pass, `0.0` on fail. Configure the
endpoint, model, timeout, and an optional custom `prompt_template` under

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document llm-as-judge's score response shape

This states that llm-as-judge reads {"pass": true|false} and returns a binary score, but LlmAsJudgeMetric::decodeStrictJson() requires a numeric score key and score() returns that continuous 0..1 value. Custom judge prompts/providers following the documented pass schema will be recorded as SampleFailures, and readers will misunderstand graded judge scores.

Useful? React with 👍 / 👎.

Comment on lines +19 to +20
where $\mathrm{norm}$ trims and case-folds. Use it when the expected output is a
single canonical fact — an id, a date, a country, a yes/no — where any deviation

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document exact-match as strict equality

The docs say exact-match trims and case-folds, but ExactMatchMetric::score() compares expectedOutput === actualOutput; whitespace, case, and punctuation are all significant. Users relying on this page will build datasets that they expect to pass after normalization but that actually score 0.0.

Useful? React with 👍 / 👎.

Comment on lines +47 to +49
**Marker mode (baseline).** The `expected_output` declares the citation markers
that should appear (e.g. `[1]`, `[source:refunds]`); the score is `1.0` when all
declared markers are present in the actual output, else `0.0`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Put citation markers in metadata, not expected_output

Marker mode is documented as reading citation markers from expected_output, but CitationGroundednessMetric::citationsFor() ignores expected_output and requires metadata.citations to contain the marker string(s). A dataset authored from this guidance will fail every cited sample with a metric contract error instead of producing groundedness scores.

Useful? React with 👍 / 👎.

Comment thread docs-site/metrics/retrieval-ranking.mdx Outdated
Comment on lines +95 to +99
Discounted Cumulative Gain at $k$, with gain $g_i$ for the document at rank $i$:

$$
\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{g_i} - 1}{\log_2(i + 1)}
$$

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match the implemented nDCG gain formula

This formula documents exponential gains (2^g - 1), but RetrievalNdcgAtKMetric::scoreRanked() computes DCG/IDCG with linear gains ($gain / log(rank + 1, 2)). For graded relevance maps, the documented nDCG values and thresholds will not match the package's actual scores.

Useful? React with 👍 / 👎.

Second round of accuracy fixes verified against the package source:

- exact-match: document strict === (case/whitespace/punctuation significant),
  not normalized/trimmed equality
- llm-as-judge: returns a continuous score 0..1 from {"score","reason"}, not a
  binary {"pass":...}; a numeric score key is required (missing key fails closed)
- refusal-quality: graded 0..1 rubric score with a refusal classification, not a
  binary match against refusal_expected
- citation-groundedness: marker mode reads metadata.citations (string|list) and
  scores the fraction of markers present, not 1.0/0.0 from expected_output
- nDCG@k: linear gains g/log2(rank+1) (matches RetrievalNdcgAtKMetric), not the
  exponential 2^g-1 variant; correct the worked-example value to ~0.65
- report contract: match JsonReportRenderer keys — samples/scores + failures
  (not rows), metric_distributions (not histograms), the actual usage shape
  (observations/totals/reported/latency_ms), and cohort/adversarial shapes
- metrics-catalog: align exact-match / judge / citation rows

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — addressed the second round in 6892ff4: exact-match strict ===, llm-as-judge/refusal-quality continuous score 0..1, citation-groundedness marker mode reads metadata.citations (fractional), nDCG linear gains g/log2(rank+1), and the report-contract JSON now matches JsonReportRenderer (samples/scores/failures, metric_distributions, real usage shape). All verified against src/.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6892ff4256

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"metrics": { "exact-match": { "mean": 0.95, "p50": 1.0, "p95": 1.0, "pass_rate": 0.95 } } },
{ "name": null, "label": "(untagged)", "is_untagged": true, "sample_count": 3, "metrics": { } }
],
"adversarial": null,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix the normal-run adversarial report shape

The public report contract documents adversarial as null for normal runs, but JsonReportRenderer always serializes adversarial from EvalReport::adversarialSummary(); the existing renderer test for an empty report asserts adversarial.total_samples === 0. UI clients generated from this contract will type/branch on null and then mis-handle real non-adversarial reports, which contain an object with zero samples instead.

Useful? React with 👍 / 👎.

use Padosoft\EvalHarness\Contracts\EmbeddingClient;

$this->app->bind(EmbeddingClient::class, fn () => new class implements EmbeddingClient {
public function embed(string $text): array

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Implement the documented embedding fake with embedMany

This fake client does not satisfy the actual EmbeddingClient contract, which defines embedMany(array $texts): array rather than embed(string $text). Anyone copying this snippet to make CI deterministic will get a PHP fatal for an unimplemented interface method instead of a working fake; the example should implement embedMany and return one vector per input in order.

Useful? React with 👍 / 👎.

Comment thread docs-site/reference/metrics-catalog.mdx Outdated
| `cosine-embedding` | [semantic](/metrics/semantic-similarity) | cosine of sentence embeddings | string expected | embeddings | model, endpoint |
| `bertscore-like` | [semantic](/metrics/semantic-similarity) | token-level greedy-match F1 | string expected | embeddings | model, endpoint |
| `llm-as-judge` | [judge](/metrics/llm-as-judge) | graded `score` 0..1 from a rubric | string expected | chat | model, prompt_template |
| `refusal-quality` | [judge](/metrics/llm-as-judge) | graded refusal-behavior `score` 0..1 | `metadata.refusal_expected` | chat | model, prompt_template |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Remove prompt_template from refusal-quality knobs

refusal-quality uses its hard-coded prompt in RefusalQualityMetric and never reads eval-harness.metrics.llm_as_judge.prompt_template; only llm-as-judge/calibration honor that config. Listing prompt_template as a knob for refusal-quality will mislead users trying to tune the adversarial/refusal rubric via EVAL_HARNESS_JUDGE_PROMPT_TEMPLATE, because their configured prompt will be ignored for this metric.

Useful? React with 👍 / 👎.

- semantic-similarity: implement the embedding fake against the real
  EmbeddingClient contract (embedMany(array): array, one vector per input),
  not a non-existent embed(string)
- report-contract: the top-level `adversarial` block is always an object
  ({total_samples:0, categories:[], compliance_frameworks:[]} on a normal run),
  never null — JsonReportRenderer always serializes adversarialSummary()
- metrics-catalog: refusal-quality uses a built-in prompt and does not read
  prompt_template — drop it from that metric's knobs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 3 in a424237: embedding fake now implements embedMany(array): array, the top-level adversarial report block is documented as always-present (zeroed on normal runs), and prompt_template removed from refusal-quality knobs (built-in prompt). Verified against src/.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4242371d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/installation.mdx Outdated
Comment on lines +75 to +79
EVAL_HARNESS_EMBEDDINGS_ENDPOINT=https://api.openai.com/v1/embeddings
EVAL_HARNESS_EMBEDDINGS_MODEL=text-embedding-3-small
EVAL_HARNESS_JUDGE_ENDPOINT=https://api.openai.com/v1/chat/completions
EVAL_HARNESS_JUDGE_MODEL=gpt-4o-mini
EVAL_HARNESS_JUDGE_API_KEY=sk-...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the embeddings API key to OpenAI setup

In the OpenAI setup, this block configures an embeddings endpoint/model but only sets EVAL_HARNESS_JUDGE_API_KEY. The embedding client reads EVAL_HARNESS_EMBEDDINGS_API_KEY with fallback to OPENAI_API_KEY from config/eval-harness.php, not the judge key, so anyone following this example and then using cosine-embedding or bertscore-like sends unauthenticated embedding requests.

Useful? React with 👍 / 👎.

flowchart LR
RUN[adversarial run] --> CHK{failure-free<br/>and gate-clean?}
CHK -->|yes| REC[record summary +<br/>advance baseline]
CHK -->|no| SKIP[summary kept,<br/>baseline NOT advanced]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document that failed gated runs are not recorded

For --regression-gate runs, the manifest store does not keep failed/gate-failed summaries: AdversarialRunManifestStore::shouldRecordRegressionGateResult() returns false when the gate failed or the report has metric failures. This diagram tells operators that those summaries are still kept, so they may rely on the manifest for failed-run audit/history that will never be written.

Useful? React with 👍 / 👎.

- installation: the OpenAI example only set the judge key, so cosine-embedding /
  bertscore-like would send unauthenticated embedding requests. Show
  OPENAI_API_KEY (the shared fallback for both judge and embeddings keys) and the
  optional per-metric EVAL_HARNESS_EMBEDDINGS_API_KEY
- adversarial-testing: correct the manifest-recording diagram — for
  --regression-gate runs, a metric-failed or gate-failed run is NOT written to
  the manifest at all (no summary, no baseline), per
  AdversarialRunManifestStore::shouldRecordRegressionGateResult(); the manifest
  is not a complete audit log of failed gated runs

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 4 in b053751: OpenAI install example now sets OPENAI_API_KEY (shared fallback) + optional EVAL_HARNESS_EMBEDDINGS_API_KEY, and the adversarial manifest diagram now states that --regression-gate runs with metric/gate failures are not recorded at all. Verified against config + AdversarialRunManifestStore.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0537517ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +99 to +101
RUN[adversarial --regression-gate run] --> CHK{failure-free<br/>AND gate passed?}
CHK -->|yes| REC[record summary +<br/>advance baseline]
CHK -->|no| SKIP[NOT recorded —<br/>no summary, no baseline]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Show clean missing-baseline runs as recorded

For a first clean --regression-gate run with no compatible baseline, AdversarialRegressionGate returns missing-baseline, not pass, but AdversarialRunManifestStore::shouldRecordRegressionGateResult() still records it because the result is not failed and the entry has no metric failures. This diagram routes anything other than “gate passed” to “NOT recorded”, which misdocuments the baseline-seeding path and can make operators think the first clean gated run will not be stored for future comparisons.

Useful? React with 👍 / 👎.

Refine the adversarial manifest-recording diagram (round-4 over-corrected). Per
AdversarialRunManifestStore::shouldRecordRegressionGateResult(), a gated run is
recorded unless it failed the gate, has metric failures, or has a missing
regression-metric aggregate. A first clean run with no baseline returns the
missing-baseline status but IS recorded — that is how the baseline is seeded.
Diagram + warning now state both halves: broken runs never seed a baseline,
clean ones (including missing-baseline) always do.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 5 in a4af2d5: refined the manifest-recording diagram to match shouldRecordRegressionGateResult() — a first clean run returns missing-baseline but IS recorded (seeds the baseline); only metric-failed, gate-failed, or missing-aggregate runs are excluded.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4af2d593d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/quickstart.mdx Outdated
{
$engine->dataset('rag.factuality.fy2026')
->loadFromYaml(base_path('eval/golden/factuality.yml'))
->withMetrics(['exact-match', 'cosine-embedding'])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep the quickstart on offline metrics

For a fresh install following this quickstart, no embeddings provider or API key has been configured before the first eval-harness:run, but cosine-embedding invokes the OpenAI-compatible embeddings client. In that scenario the sample run makes an external embeddings request with no token, records a SampleFailure, and exits non-zero, so the five-minute quickstart fails before users see a report. Use only offline metrics here or add an explicit provider/API-key setup step before this registrar.

Useful? React with 👍 / 👎.

Keep the quickstart runnable on a fresh install with no API key: swap the
registrar's cosine-embedding (which calls an embeddings provider and would
SampleFailure + exit non-zero without a key) for the offline rouge-l metric,
relabel the sample report row, and add a Tip pointing users to provider setup
when they want semantic/judge metrics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 6 in 54dc7c5: the quickstart now uses offline metrics only (exact-match, rouge-l) so a fresh install runs with no API key, with a Tip pointing to provider setup for semantic/judge metrics.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54dc7c56f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/quickstart.mdx Outdated
Comment on lines +112 to +113
Wire that into the same workflow that runs your PHPUnit suite and you have a
regression gate.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add a real quality-threshold gate to the quickstart

In this quickstart, CI only consumes eval-harness:run's process status. I checked EvalCommand::handle: after a report is written it fails only when totalFailures() !== 0, so low exact-match/rouge-l scores still exit 0 as long as metrics execute cleanly. If the SUT returns wrong answers without throwing, this workflow passes instead of gating a regression; please add the JSON macro_f1 assertion here or make clear this is only an execution-health check and link users to the CI quality-threshold step.

Useful? React with 👍 / 👎.

Clarify the quickstart's exit-code semantics: it is an execution-health signal
(non-zero only on captured metric failures), NOT a quality gate — a SUT that
returns wrong-but-clean answers still exits 0. Add a Warning pointing to the
macro_f1 quality-threshold step in the CI gate guide.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 7 in b150984: the quickstart now states the exit code is an execution-health signal (not a quality gate) and links to the macro_f1 quality-threshold step. This matches EvalCommand::handle (fails only on totalFailures).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1509842d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/introduction.mdx Outdated
Comment on lines +13 to +14
of declared metrics, and a CI gate that exits non-zero the moment quality
drops.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not imply quality drops fail the command automatically

In a cleanly executed run with low metric scores, EvalCommand::handle() still returns success because it only fails when totalFailures() > 0; this sentence tells readers the CI gate exits non-zero as soon as quality drops, so teams following the intro can rely on the Artisan exit code and miss regressions unless they add the documented macro_f1 threshold check.

Useful? React with 👍 / 👎.

Reword the introduction's CI-gate claim: the raw command exit code is an
execution-health signal and does not fail on merely low scores. The quality
gate comes from asserting a macro_f1 threshold (or a metric erroring); link to
the CI gate guide. Mirrors EvalCommand::handle (fails only on totalFailures).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 8 in c6689ff: reworded the introduction so the raw exit code is described as an execution-health signal, with the quality gate coming from a macro_f1 threshold assertion (linked). Swept the rest of the docs for the same phrasing.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6689ff3d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


```php
use Padosoft\EvalHarness\Datasets\DatasetSample;
use Padosoft\EvalHarness\Facades\EvalFacade as Eval;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid aliasing the facade as Eval

When this snippet is copied into a PHP file, the import itself fails to parse because eval is a reserved language construct and cannot be used as a use ... as alias; the same pattern appears in the golden-datasets programmatic example. That makes the advertised facade examples unusable for readers, so they should call EvalFacade::... directly or use the registered global facade as \Eval::... instead.

Useful? React with 👍 / 👎.

`Eval` is a reserved word in PHP (T_EVAL), so `use ...EvalFacade as Eval;` and
`Eval::...` are parse errors — the facade's own docblock notes this. Use
`EvalFacade::` directly across core-concepts, golden-datasets,
running-evaluations, and scoring-saved-outputs (code + prose + frontmatter), and
add the matching import where the eval-set snippet was missing it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 9 in 7b431c2: replaced the reserved-word Eval facade alias (a PHP parse error, per EvalFacade's own docblock) with EvalFacade:: directly across all four files, plus the missing import in the eval-set snippet.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b431c2870

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/guides/online-monitoring.mdx Outdated
Comment on lines +84 to +85
reference exists, lean on metrics that judge the answer on its own merits
(e.g. refusal/helpfulness rubrics) rather than against a gold string.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't recommend refusal rubrics for online samples

When users have no runtime reference and follow this advice by setting the online metric to the built-in refusal-quality, every sampled JudgeLiveSampleJob will fail: the job constructs DatasetSample without any metadata, while RefusalQualityMetric requires metadata.refusal_expected. That means no OnlineScore rows or drift alerts are produced for sampled traffic, so this note should either limit the advice to custom metrics / custom llm-as-judge prompts or document a supported way to provide refusal metadata.

Useful? React with 👍 / 👎.

Online monitoring: stop recommending refusal rubrics for sampled traffic. The
JudgeLiveSampleJob builds DatasetSample from id/input/expected with no metadata,
but refusal-quality requires metadata.refusal_expected, so using it as the
online metric fails every sample (no OnlineScore rows, no drift alerts). Reword
to keep the llm-as-judge default (graded from input/output alone) and add a
Warning steering refusal scoring to the adversarial lane.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 10 in a32f375: online-monitoring note no longer recommends refusal-quality (the JudgeLiveSampleJob builds DatasetSample with no metadata, so it would fail every sample); keeps the llm-as-judge default and steers refusal scoring to the adversarial lane.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a32f37522b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/guides/online-monitoring.mdx Outdated
Comment on lines +84 to +86
reference exists, keep the default `online.metric` of `llm-as-judge` (or a
custom `llm-as-judge` prompt that grades the answer on its own merits), which
scores from the input/output alone.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clarify that the default judge still needs a reference

In the no-reference case this guidance will produce misleading online scores: JudgeLiveSampleJob passes the supplied expected into DatasetSample, and the default llm-as-judge prompt grades ACTUAL against EXPECTED rather than input/output alone. If operators pass an empty or placeholder reference based on this note, online drift monitoring will judge answers against that empty placeholder; only a custom prompt that explicitly ignores {expected} supports reference-less scoring.

Useful? React with 👍 / 👎.

Comment thread docs-site/reference/report-api.mdx Outdated
Comment on lines +31 to +33
## The response envelope

Every JSON response carries a version discriminator:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit the schema envelope claim to success responses

This overstates the report API contract for error paths: several controllers throw NotFoundHttpException, UnprocessableEntityHttpException, or ServiceUnavailableHttpException for missing artifacts, malformed report JSON, and storage/cache failures, and Laravel renders those as standard error JSON rather than this schema_version envelope. UI clients that branch on schema_version for every response will break on those documented 404/422/503 cases, so this should either say successful JSON responses are enveloped or document the unversioned error shape.

Useful? React with 👍 / 👎.

- online-monitoring: correct the round-10 wording — the default llm-as-judge
  grades ACTUAL against EXPECTED, so it needs a real reference (an empty/placeholder
  makes scores meaningless). Reference-less online scoring requires a custom
  llm-as-judge prompt_template that ignores {expected}
- report-api: limit the schema_version envelope claim to successful (2xx)
  responses; document that 404/422/503 are standard Laravel error JSON without
  the envelope, so clients must branch on HTTP status first

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 11 in b22418b: online-monitoring now states the default llm-as-judge grades against EXPECTED (needs a real reference; reference-less requires a custom prompt ignoring {expected}); report-api limits the schema_version envelope to 2xx and documents 404/422/503 as standard Laravel error JSON.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b22418b9ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +65 to +66
**`SampleInvocation`** DTO (`sample_id` + `input`) and the worker reconstructs the
rest from the dataset registry. This avoids serialization failures and stops the

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Document SampleInvocation as id/input only

For lazy-parallel runs, the worker does not reconstruct a DatasetSample from the registry: EvaluateSampleJob::handle() resolves the runner and passes only the queued SampleInvocation (whose public fields are id and input), while the producer scores the returned outputs later against its original dataset. As written, this tells queue users that expected output/metadata will be available in workers, which will fail for SampleRunner implementations that rely on that state.

Useful? React with 👍 / 👎.

Comment thread docs-site/reference/report-api.mdx Outdated
Comment on lines +45 to +50
**Error responses are not enveloped.** Missing artifacts/ids (`404`), malformed
report JSON on a JSON-only endpoint (`422`), and storage/cache failures (`503`)
are thrown as standard Laravel HTTP exceptions and rendered as Laravel's default
error JSON (e.g. `{ "message": "..." }`) — **without** a `schema_version`. A
client should branch on the **HTTP status first**, and only read
`schema_version` / `data` on a `2xx` response.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude disabled manifest 404s from the error-envelope rule

When adversarial manifest discovery is disabled (eval-harness.adversarial.manifests.disk is null), ManifestController::discoveryNotConfigured() returns a 404 JSON body containing schema_version, error: discovery_not_configured, and message, not Laravel's default { "message": "..." } error shape. This blanket statement will make API clients ignore the documented disabled-surface signal or treat the existing 404 envelope as a contract violation.

Useful? React with 👍 / 👎.

- evaluation-pipeline + decisions: correct the SampleInvocation description.
  Fields are id + input (not sample_id); the worker hands ONLY the invocation to
  the runner and does NOT reconstruct the full DatasetSample, so
  expected_output/metadata are unavailable in workers. Scoring runs in the
  producer against the original dataset. Add a warning that runners needing
  expected/metadata are serial-only
- report-api: carve out the disabled-manifest 404 from the unenveloped-error
  rule — it returns an enveloped {schema_version, error: discovery_not_configured,
  message} body, a deliberate "feature off" signal

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 12 in aab7456: SampleInvocation documented as id+input only with no worker-side reconstruction (expected/metadata unavailable in workers; scoring in the producer), and the disabled-manifest 404 carved out as an enveloped discovery_not_configured signal.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aab7456efa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/reference/report-api.mdx Outdated
Comment on lines +57 to +58
— a structured "feature off" signal, not a generic error. Treat a `404` on the
manifest endpoints as "discovery not configured" rather than "missing".

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check the manifest error body before treating 404 as disabled

When adversarial.manifests.disk is configured but a UI asks for a manifest name that doesn't exist, ManifestController::show() still returns a normal NotFoundHttpException 404 (the repository throws EvalRunException for the missing file), not the enveloped discovery_not_configured body. This sentence tells clients to treat every manifest 404 as feature-off, so they'd hide genuinely missing links; document checking error === 'discovery_not_configured' instead.

Useful? React with 👍 / 👎.

Comment thread docs-site/reference/report-api.mdx Outdated

| Endpoint | Purpose |
| --- | --- |
| `GET /batches/live` | Active lazy-parallel batch ids + status. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't promise status in the live batch list

For a client building the live-batches view from this catalog, GET /batches/live does not include any per-batch status: BatchLiveController::live() renders BatchLiveResource, which only emits each batch's id and expires_at; status is returned by /batches/{id}/progress. As written, UI clients will dereference a field that is never sent unless they make the progress call, so this row should either remove + status or say that status requires the progress endpoint.

Useful? React with 👍 / 👎.

report-api refinements:
- manifest 404s: distinguish the enveloped discovery_not_configured (feature
  off) from a plain NotFoundHttpException (configured but the named manifest
  doesn't exist) — clients must check the error field, not treat every 404 as
  feature-off
- /batches/live emits id + expires_at only; per-batch status comes from
  /batches/{id}/progress. Correct the catalog rows so clients don't dereference a
  status field the list endpoint never sends

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 13 in 59c4b7a: manifest 404s now distinguish discovery_not_configured (feature off) from a plain missing-name 404, and /batches/live is documented as id + expires_at only (status comes from /batches/{id}/progress).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 59c4b7a46f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs-site/metrics/overview.mdx Outdated
Comment on lines +8 to +9

Every metric implements one method. It receives a sample and the SUT's actual

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Describe the Metric contract accurately

The actual Padosoft\EvalHarness\Metrics\Metric interface requires both name() and score(), so saying every metric implements one method misleads users writing custom metrics. If they follow this prose and only implement score(), their class cannot satisfy the interface and will not be usable through the resolver; please update this sentence and the similar “single-method” wording elsewhere to reflect the two-method contract.

Useful? React with 👍 / 👎.


```mermaid
flowchart LR
B[LazyParallelBatch] -->|every N done| PR["BatchProgressReporter::report()"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Use the real checkpoint callback name

The BatchProgressReporter interface exposes reportCheckpoint(), not report(). A reader implementing a reporter from this diagram would end up with the wrong callback name and fail to implement the package contract, so the label should point to BatchProgressReporter::reportCheckpoint().

Useful? React with 👍 / 👎.

- Metric is a two-method interface (name() + score()), not one method — fix the
  "one method"/"single-method" wording in metrics/overview, core-concepts, and
  metrics-catalog so custom-metric authors implement both
- BatchProgressReporter's callback is reportCheckpoint(), not report() — fix the
  batch-execution diagram label

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lopadova

Copy link
Copy Markdown
Contributor Author

@codex review — round 14 in 520d97b: Metric documented as a two-method interface (name() + score()) across overview/core-concepts/catalog, and the batch-execution diagram now labels the callback reportCheckpoint().

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

Reviewed commit: 520d97b345

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@lopadova lopadova merged commit 69e486c into main Jun 17, 2026
6 checks passed
@lopadova lopadova deleted the docs/mintlify-doc-site branch June 17, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant