evaleval · nelaturuharsha · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md
@@ -31,3 +31,4 @@ uv run python -m every_eval_ever --help
 - See [Data Structure](../data-structure/)
 - See [Eval Converters](../eval-converters/)
 - See [Contributing](../contributing/)
+- See [HF Community Evals](../hf-community-evals/)
diff --git a/docs/hf-community-evals/index.md b/docs/hf-community-evals/index.md
@@ -0,0 +1,132 @@
+---
+layout: default
+title: HF Community Evals
+nav_order: 6
+---
+
+# EEE -> HF Community Evals
+
+Built and maintained by Harsha Nelaturu · EvalEval Coalition · June 2026.
+
+Use `tools/hf-community-evals/community_evals_converter.py` to review one EEE datastore collection, generate
+local HF Community Evals YAML previews, audit existing scores/open PRs, and
+optionally open PRs after explicit approval.
+
+## Quick Start
+
+Use `uv run` for all commands.
+
+```bash
+uv run tools/hf-community-evals/community_evals_converter.py MMLU-Pro \
+  --datastore evaleval/EEE_datastore@main
+```
+
+This will cache the results for this particular collection and if you would like to force a fresh rebuild:
+
+```bash
+uv run tools/hf-community-evals/community_evals_converter.py MMLU-Pro \
+  --datastore evaleval/EEE_datastore@main \
+  --force
+```
+
+The positional argument is a collection stem. It must resolve exactly to:
+
+```text
+https://huggingface.co/datasets/evaleval/EEE_datastore/flat/indexes/by_collection/<collection>.jsonl
+```
+
+## Outputs
+
+For `MMLU-Pro`, outputs are written under:
+
+```text
+outputs/community_evals_converter_MMLU-Pro/
+```
+
+Important output files:
+
+- `manifest.json`: converted candidate records plus skipped/error metadata.
+- `review.json`: full review result, duplicate audit findings, audit errors,
+  and PR readiness.
+- `yamls/<owner>/<model>/.eval_results/<benchmark>.yaml`: local YAML previews.
+
+`outputs/` is ignored by git. Use these files for inspection, not as merge
+inputs.
+
+## Review Behavior
+
+The tool:
+
+- downloads the collection JSONL and referenced aggregate objects from the HF
+  datastore;
+- validates object hashes and optional sizes;
+- scans each aggregate record for supported HF benchmark datasets;
+- writes YAML entries using the datastore object HF URL as `source.url`;
+- keeps flat datastore provenance, including instance-level references when
+  present;
+- checks model repo existence on Hugging Face;
+- audits every existing `.eval_results/*.yaml` file on model `main`;
+- audits changed `.eval_results/*.yaml` files in open PR refs;
+- compares by dataset/task content, not YAML filename.
+
+Supported benchmarks in this workflow are:
+
+- `mmlu_pro`
+- `gpqa`
+- `hle`
+- `gsm8k`
+
+## Resume And Force
+
+Default reruns reuse exact-match local outputs:
+
+- matching completed `review.json`: skips collection downloads, model checks,
+  and duplicate audit;
+- matching pre-audit `manifest.json`: skips collection downloads and model
+  checks, then resumes at duplicate audit.
+
+The cache must match collection name, datastore input, and HF-check mode.
+Invalid exact-match cache files are hard errors. Use `--force` when you want to
+ignore the cache and rebuild from the datastore.
+
+## TUI
+The final report has:
+
+- `Community Evals Converter`: summary counts.
+- `Needs Attention`: capped triage table for blockers and exclusions.
+
+`Needs Attention` uses:
+
+- `Issue`: `audit_error`, `score_conflict`, `already_present`,
+  `missing_hf_model`, or `skipped`.
+- `Model`: model repo or aggregate model id.
+- `Details`: reason or score comparison.
+- `Action`: `exclude`, `block entry`, `block all`, or source line.
+- `Where`: terminal hyperlink to the HF model PR/file or HF datastore blob URL.
+
+Repeated same-score `already_present` findings are summarized as one count row.
+Full details remain in `review.json`.
+
+## PR Submission
+
+The tool only opens PRs after both prompts succeed:
+
+1. Type exactly:
+
+   ```text
+   OPEN PRS
+   ```
+
+2. Enter a non-empty commit message.
+
+Only `status = ready` entries are submitted.
+
+Excluded statuses:
+
+- `already_present`: same score already exists.
+- `score_conflict`: different score already exists.
+- `missing_hf_model`: model repo does not resolve on HF.
+- `audit_error`: candidate-scoped audit failure.
+
+Candidate-scoped audit errors block only that candidate. Audit errors without a
+manifest entry block all PR submission.
diff --git a/pyproject.toml b/pyproject.toml
@@ -20,6 +20,7 @@ dependencies = [
     "numpy>=2.4.1",
     "pandas>=2.3.3",
     "pydantic>=2.12.5,<3.0.0",
+    "pyyaml>=6.0.3",
     "requests>=2.32.5,<3.0.0",
     "rich>=14.0.0,<15.0.0",
     "seaborn>=0.13.2",

diff --git a/tests/data/community_evals_converter/aggregate.jsonl b/tests/data/community_evals_converter/aggregate.jsonl
@@ -0,0 +1 @@
+{"benchmark":"MMLU-Pro","eval_schema_version":"0.2.2","legacy_path":"data/MMLU-Pro/01-ai/yi-1.5-34b-chat/676f4465-ce78-411a-9f5a-c97b3d2eac4f.json","object_path":"flat/objects/67/6f/676f4465-ce78-411a-9f5a-c97b3d2eac4f.json","object_uuid":"676f4465-ce78-411a-9f5a-c97b3d2eac4f","record_type":"aggregate","sha256":"a9cc2e4399f182f2e8d1a6198248e124ceafee6f70cbf5ddf31e76d1e74e6f94","size_bytes":23648}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"benchmark":"MMLU-Pro","eval_schema_version":"0.2.2","legacy_path":"data/MMLU-Pro/01-ai/yi-1.5-34b-chat/676f4465-ce78-411a-9f5a-c97b3d2eac4f.json","object_path":"flat/objects/67/6f/676f4465-ce78-411a-9f5a-c97b3d2eac4f.json","object_uuid":"676f4465-ce78-411a-9f5a-c97b3d2eac4f","record_type":"aggregate","sha256":"a9cc2e4399f182f2e8d1a6198248e124ceafee6f70cbf5ddf31e76d1e74e6f94","size_bytes":23648}