Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/getting-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ uv run python -m every_eval_ever --help
- See [Data Structure](../data-structure/)
- See [Eval Converters](../eval-converters/)
- See [Contributing](../contributing/)
- See [HF Community Evals](../hf-community-evals/)
132 changes: 132 additions & 0 deletions docs/hf-community-evals/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
layout: default
title: HF Community Evals
nav_order: 6
---

# EEE -> HF Community Evals

Built and maintained by Harsha Nelaturu · EvalEval Coalition · June 2026.

Use `tools/hf-community-evals/community_evals_converter.py` to review one EEE datastore collection, generate
local HF Community Evals YAML previews, audit existing scores/open PRs, and
optionally open PRs after explicit approval.

## Quick Start

Use `uv run` for all commands.

```bash
uv run tools/hf-community-evals/community_evals_converter.py MMLU-Pro \
--datastore evaleval/EEE_datastore@main
```

This will cache the results for this particular collection and if you would like to force a fresh rebuild:

```bash
uv run tools/hf-community-evals/community_evals_converter.py MMLU-Pro \
--datastore evaleval/EEE_datastore@main \
--force
```

The positional argument is a collection stem. It must resolve exactly to:

```text
https://huggingface.co/datasets/evaleval/EEE_datastore/flat/indexes/by_collection/<collection>.jsonl
```

## Outputs

For `MMLU-Pro`, outputs are written under:

```text
outputs/community_evals_converter_MMLU-Pro/
```

Important output files:

- `manifest.json`: converted candidate records plus skipped/error metadata.
- `review.json`: full review result, duplicate audit findings, audit errors,
and PR readiness.
- `yamls/<owner>/<model>/.eval_results/<benchmark>.yaml`: local YAML previews.

`outputs/` is ignored by git. Use these files for inspection, not as merge
inputs.

## Review Behavior

The tool:

- downloads the collection JSONL and referenced aggregate objects from the HF
datastore;
- validates object hashes and optional sizes;
- scans each aggregate record for supported HF benchmark datasets;
- writes YAML entries using the datastore object HF URL as `source.url`;
- keeps flat datastore provenance, including instance-level references when
present;
- checks model repo existence on Hugging Face;
- audits every existing `.eval_results/*.yaml` file on model `main`;
- audits changed `.eval_results/*.yaml` files in open PR refs;
- compares by dataset/task content, not YAML filename.

Supported benchmarks in this workflow are:

- `mmlu_pro`
- `gpqa`
- `hle`
- `gsm8k`

## Resume And Force

Default reruns reuse exact-match local outputs:

- matching completed `review.json`: skips collection downloads, model checks,
and duplicate audit;
- matching pre-audit `manifest.json`: skips collection downloads and model
checks, then resumes at duplicate audit.

The cache must match collection name, datastore input, and HF-check mode.
Invalid exact-match cache files are hard errors. Use `--force` when you want to
ignore the cache and rebuild from the datastore.

## TUI
The final report has:

- `Community Evals Converter`: summary counts.
- `Needs Attention`: capped triage table for blockers and exclusions.

`Needs Attention` uses:

- `Issue`: `audit_error`, `score_conflict`, `already_present`,
`missing_hf_model`, or `skipped`.
- `Model`: model repo or aggregate model id.
- `Details`: reason or score comparison.
- `Action`: `exclude`, `block entry`, `block all`, or source line.
- `Where`: terminal hyperlink to the HF model PR/file or HF datastore blob URL.

Repeated same-score `already_present` findings are summarized as one count row.
Full details remain in `review.json`.

## PR Submission

The tool only opens PRs after both prompts succeed:

1. Type exactly:

```text
OPEN PRS
```

2. Enter a non-empty commit message.

Only `status = ready` entries are submitted.

Excluded statuses:

- `already_present`: same score already exists.
- `score_conflict`: different score already exists.
- `missing_hf_model`: model repo does not resolve on HF.
- `audit_error`: candidate-scoped audit failure.

Candidate-scoped audit errors block only that candidate. Audit errors without a
manifest entry block all PR submission.
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dependencies = [
"numpy>=2.4.1",
"pandas>=2.3.3",
"pydantic>=2.12.5,<3.0.0",
"pyyaml>=6.0.3",
"requests>=2.32.5,<3.0.0",
"rich>=14.0.0,<15.0.0",
"seaborn>=0.13.2",
Expand Down
1 change: 1 addition & 0 deletions tests/data/community_evals_converter/aggregate.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"benchmark":"MMLU-Pro","eval_schema_version":"0.2.2","legacy_path":"data/MMLU-Pro/01-ai/yi-1.5-34b-chat/676f4465-ce78-411a-9f5a-c97b3d2eac4f.json","object_path":"flat/objects/67/6f/676f4465-ce78-411a-9f5a-c97b3d2eac4f.json","object_uuid":"676f4465-ce78-411a-9f5a-c97b3d2eac4f","record_type":"aggregate","sha256":"a9cc2e4399f182f2e8d1a6198248e124ceafee6f70cbf5ddf31e76d1e74e6f94","size_bytes":23648}
Loading
Loading