Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,22 @@

![eval-harness report banner](https://raw.githubusercontent.com/padosoft/eval-harness/main/resources/banner.png)

## Official Documentation

📚 **Full documentation is available at
[doc.eval-harness.padosoft.com](https://doc.eval-harness.padosoft.com/).**

The documentation site covers everything in depth: a five-minute quickstart, the
fifteen built-in metrics with their underlying theory and formulas, guides for
CI gating, judge calibration, online monitoring and adversarial testing, the
batch/Horizon operations model, the architecture and decision records, and the
full CLI, configuration, and report-API reference.

---

## Table of Contents

0. [Official Documentation](https://doc.eval-harness.padosoft.com/)
1. [Why eval-harness?](#why-eval-harness)
2. [Design rationale](#design-rationale)
3. [Features](#features)
Expand Down
49 changes: 49 additions & 0 deletions docs-site/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# eval-harness documentation site (Mintlify)

This folder is the **public documentation site** for
[`padosoft/eval-harness`](https://github.com/padosoft/eval-harness), published
with [Mintlify](https://mintlify.com) at
**[doc.eval-harness.padosoft.com](https://doc.eval-harness.padosoft.com/)**.

It is intentionally separate from the internal engineering docs in
[`/docs`](../docs) (roadmap, rules, progress, lessons, contract specs):
`/docs-site` is the curated, end-user-facing reference, authored at
senior-architect / academic depth.

## Local preview

```bash
npm i -g mint # one-time: the Mintlify CLI
cd docs-site
mint dev # http://localhost:3000
```

`mint dev` renders the site from `docs.json` + the `*.mdx` pages and reports
broken links.

## Layout

- `docs.json` — site config + the groups-based navigation. **Every page must be
registered here**, and Mintlify errors on a nav entry whose `.mdx` file does
not exist.
- `*.mdx`, `guides/*.mdx`, `metrics/*.mdx`, `best-practices/*.mdx`,
`operations/*.mdx`, `architecture/*.mdx`, `reference/*.mdx` — one file per
page.
- `favicon.svg` — site favicon.

## Authoring standard

Pages follow the deep-doc template: **motivation → theory (with formulas where
relevant) → design with a Mermaid diagram → data/contract model → ADR-style
rationale → worked example → gotchas**. New capabilities and README changes
should ship their matching deep page here. Components used: `<Note>`,
`<Warning>`, `<Tip>`, `<Steps>`, `<CardGroup>`/`<Card>`,
`<AccordionGroup>`/`<Accordion>`, and Mermaid `flowchart` / `sequenceDiagram`
blocks.

## Deployment

Connect the Mintlify GitHub App to this repository with the content directory
set to `docs-site/`. Every push to `main` that touches `docs-site/`
auto-deploys to the live site at
[doc.eval-harness.padosoft.com](https://doc.eval-harness.padosoft.com/).
207 changes: 207 additions & 0 deletions docs-site/architecture/decisions.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
---
title: "Architecture decisions"
description: "The curated narrative of the load-bearing decisions — YAML-not-DB datasets, raw Http:: over vendor SDKs, failures-as-data, queue-serializable SUTs, manifest-backed baselines, headless API, and the standalone-agnostic rule — each as problem, decision, and consequence."
icon: "scroll"
---

## Motivation

Code shows *what* the system does; it rarely shows *why*. These are the
load-bearing decisions that shaped eval-harness — each a problem, the choice
made, and the trade-off accepted. They are grouped into the arcs they belong to.
Do not unwind a non-obvious one without understanding what it holds up.

## The data arc: knowledge as reviewable, local-first state

<AccordionGroup>
<Accordion title="Datasets are YAML, never database rows">
**Problem.** Evaluation datasets must be reviewable, diffable, and survive
database wipes. A DB-backed dataset is invisible in a pull request and
couples evaluation to schema.

**Decision.** Golden datasets live in `eval/golden/*.yml` and load through a
strict-schema loader; the package never persists them. They are versioned
with your code and reviewed like code.

**Consequence.** No UI-driven dataset curation, and operators need YAML
fluency — accepted in exchange for PR-reviewable, diffable, durable datasets.
</Accordion>
<Accordion title="The report is human-readable and machine-versioned">
**Problem.** A report must serve both a human reading a PR and a dashboard
wiring in once.

**Decision.** Two renderers over one immutable `EvalReport` — Markdown for
humans, versioned JSON (`eval-harness.report.v1`) for machines — evolving
additively.

**Consequence.** Consumers branch on `schema_version` and trust fields not to
change meaning within a major. See [the report contract](/architecture/report-contract).
</Accordion>
<Accordion title="Minimal schema footprint — one auto-loaded migration">
**Problem.** A package that mutates the host schema on install is intrusive.

**Decision.** Datasets are YAML and results are JSON on a disk. The package
ships exactly **one** migration — `eval_harness_online_scores` for the
optional online-monitoring feature — auto-loaded by the service provider so it
is available to host apps and to `RefreshDatabase` in tests.

**Consequence.** `php artisan migrate` creates that single table; the feature
is off by default so it stays empty until you opt in, and no existing table is
touched.
</Accordion>
</AccordionGroup>

## The provider arc: control and testability over convenience

<AccordionGroup>
<Accordion title="Raw Http:: — no vendor AI SDKs">
**Problem.** Vendor SDKs hide auth, retries, timeouts, and response parsing,
and make deterministic offline tests hard.

**Decision.** Every embedding and judge call goes through Laravel's `Http::`
facade against an OpenAI-compatible endpoint. Tests substitute `Http::fake()`.

**Consequence.** Swapping providers is a config change, not a refactor, and
the whole external surface is fakeable — at the cost of managing endpoints and
auth yourself.
</Accordion>
<Accordion title="Deterministic judges, narrow retries">
**Problem.** A judge that varies run-to-run is useless as a gate; aggressive
retries can mask real failures.

**Decision.** The judge pins `temperature 0`, `seed 42`, `response_format=json_object`
and rejects malformed JSON loudly. Retries cover only connection failures,
429, and 5xx — never a malformed 200.

**Consequence.** Reproducible verdicts and a fail-closed posture on contract
violations.
</Accordion>
</AccordionGroup>

## The execution arc: failures as data, queues done safely

<AccordionGroup>
<Accordion title="Metric exceptions are captured by default">
**Problem.** A timeout on sample 47 should not erase the macro-F1 across 199
valid samples.

**Decision.** Every metric exception is recorded as a `SampleFailure` against
`(sample, metric)` and surfaced in the report. The exit code still reflects
it; strict lanes can opt into `raise_exceptions` to fail fast.

**Consequence.** Operators investigate one case instead of re-running a long
suite, while CI still fails on any captured failure.
</Accordion>
<Accordion title="Lazy-parallel preserves positional ordering">
**Problem.** Queue jobs finish out of order, but a report must be
deterministic and comparable.

**Decision.** Outputs are written to a cache-backed result store keyed by
batch id + sample index and reassembled in dataset order.

**Consequence.** A queue-backed run produces a report identical to a serial
one — at the cost of a shared cache store and Horizon sizing discipline.
</Accordion>
<Accordion title="A SampleInvocation DTO carries queue work">
**Problem.** A full `DatasetSample` may hold objects, resources, or invalid
UTF-8 that won't serialize onto a queue.

**Decision.** Jobs carry a minimal `SampleInvocation` (fields `id` + `input`
only); the worker hands just that to the runner and does **not** reconstruct
the full sample, so `expected_output` / `metadata` are unavailable in the
worker (scoring runs later in the producer against the original dataset).
Queued SUTs must therefore be container-resolvable `SampleRunner` classes that
need only `input`.

**Consequence.** No serialization failures and no contract creep — closures,
caller-specific runner state, and runners that need expected/metadata stay
serial-only.
</Accordion>
</AccordionGroup>

## The safety arc: opt-in red-teaming, clean baselines

<AccordionGroup>
<Accordion title="Adversarial coverage is opt-in">
**Problem.** A bundled red-team daemon is intrusive and hard to govern.

**Decision.** The adversarial lane is an explicit command with no background
process; you schedule it from Laravel Scheduler or CI cron.

**Consequence.** Full host control and transparency — at the cost of
remembering to schedule it.
</Accordion>
<Accordion title="Broken runs can never seed a baseline">
**Problem.** A drift gate is only sound if its baselines are clean; a
regression must not silently redefine "normal".

**Decision.** Manifest baselines advance only on failure-free, gate-clean
runs, keyed by report-schema / dataset / metric / category / sample-count.
Writes are lock-serialized and atomic.

**Consequence.** A `missing-baseline` status on fresh slices is expected and
correct; you cannot fake a baseline by hand without breaking the guarantee.
</Accordion>
</AccordionGroup>

## The boundary arc: headless and self-contained

<AccordionGroup>
<Accordion title="The report API is read-only and bundles no auth">
**Problem.** Shipping authentication in a library forces a security model on
every host.

**Decision.** The API is opt-in, read-only, and mounted behind the host app's
own middleware; it is disabled by default.

**Consequence.** Hosts wire their existing admin auth — the package never
leaks artifacts by default.
</Accordion>
<Accordion title="The package depends on none of its consumers">
**Problem.** Tooling extracted from an application tends to keep secret ties
back to it.

**Decision.** An architecture test walks `src/` and fails on any reference to a
consumer's internals or a sibling package.

**Consequence.** A genuinely reusable evaluation substrate — `composer require`
behaves identically with or without AskMyDocs or any sibling app.
</Accordion>
</AccordionGroup>

## How the arcs depend on each other

```mermaid
flowchart LR
YAML[YAML datasets] --> REP[versioned report]
HTTP["raw Http::"] --> DET[deterministic judges]
DET --> CAP[failures as data]
CAP --> LZP[order-preserving lazy-parallel]
LZP --> DTO[SampleInvocation DTO]
REP --> API[headless read-only API]
CAP --> ADV[opt-in adversarial]
ADV --> BASE[clean manifest baselines]
AGN[standalone-agnostic] --> API
AGN --> YAML
```

## Decision rationale (meta)

- **Why record decisions at all?** A wrong assumption baked into a quick fix
propagates into every later PR. The durable *why* keeps future changes from
unwinding deliberate constraints by accident.
- **Don't unwind a load-bearing choice silently.** The raw-`Http::` rule,
failures-as-data, the queue-serializable SUT contract, clean-baseline-only
manifests, and the standalone-agnostic guarantee each hold up something above
them. The full editorial record lives in the repository's `docs/` — see
[`CONTRACT_STABILITY.md`](https://github.com/padosoft/eval-harness/blob/main/docs/CONTRACT_STABILITY.md)
and [`LESSON.md`](https://github.com/padosoft/eval-harness/blob/main/docs/LESSON.md).

<CardGroup cols={2}>
<Card title="Architecture overview" icon="sitemap" href="/architecture/overview">
The component map these decisions produced.
</Card>
<Card title="The evaluation pipeline" icon="diagram-project" href="/architecture/evaluation-pipeline">
The run lifecycle, stage by stage.
</Card>
</CardGroup>
Loading
Loading