ci(evals): add GitHub Check Run reporting for evaluation results #640

dcramer · 2025-11-20T01:26:24Z

Create automated check runs that report eval scores and statistics:

Add create-eval-check.js script to parse eval-results.json and create GitHub Check Runs
Update eval:ci to output JSON results via --reporter=json
Set check conclusion to 'success' if avg score >= 0.5, 'failure' otherwise
Display overall statistics and score distribution (green/yellow/red)
Run check creation even if evals fail (continue-on-error + !cancelled)
Add checks:write permission to eval workflow
Ignore generated eval-results.json in version control

The check run provides:

Overall average score with pass/fail threshold (0.50)
Score distribution by category (>=0.75, 0.50-0.74, <0.50)
Individual eval scores sorted by performance
Rationale for failed or low-scoring tests

cursor · 2025-11-20T02:22:47Z

.github/scripts/create-eval-check.js

+    `### Conclusion`,
+    ``,
+    conclusion === "success"
+      ? `✅ **Passed**: Average score (${avgScore.toFixed(2)}) is above the catastrophic failure threshold (0.50)`


Bug: Incorrect threshold message for passing score

The success message says the score "is above" the threshold, but the actual check uses >= which includes scores equal to 0.50. When the average score is exactly 0.50, the check passes but displays "Average score (0.50) is above the catastrophic failure threshold (0.50)", which is mathematically incorrect. The message needs to say "is at or above" to match the >= comparison.

.github/workflows/eval.yml

        if: ${{ !cancelled() }}
        uses: codecov/test-results-action@v1
        with:
          token: ${{ secrets.CODECOV_TOKEN }}


.github/workflows/eval.yml

+                  evalResults.push({
+                    name: test.fullName || test.title,
+                    file: testFile.name,
+                    avgScore: test.meta.eval.avgScore ?? null,
+                    scores: test.meta.eval.scores || [],
+                    passed: test.status === 'passed',
+                    duration: test.duration,
+                  });
+                }
+              }
+            }


.github/workflows/eval.yml

Create automated check runs that report eval scores and statistics: - Add create-eval-check.js script to parse eval-results.json and create GitHub Check Runs - Update eval:ci to output JSON results via --reporter=json - Set check conclusion to 'success' if avg score >= 0.5, 'failure' otherwise - Display overall statistics and score distribution (green/yellow/red) - Run check creation even if evals fail (continue-on-error + !cancelled) - Add checks:write permission to eval workflow - Ignore generated eval-results.json in version control The check run provides: - Overall average score with pass/fail threshold (0.50) - Score distribution by category (>=0.75, 0.50-0.74, <0.50) - Individual eval scores sorted by performance - Rationale for failed or low-scoring tests Co-Authored-By: Claude Code <[email protected]>

dcramer temporarily deployed to Actions November 20, 2025 01:26 — with GitHub Actions Inactive

dcramer force-pushed the eval-action branch from e4b2cdd to 06956fe Compare November 20, 2025 02:02

dcramer temporarily deployed to Actions November 20, 2025 02:02 — with GitHub Actions Inactive

dcramer force-pushed the eval-action branch from 06956fe to c135c75 Compare November 20, 2025 02:18

dcramer temporarily deployed to Actions November 20, 2025 02:18 — with GitHub Actions Inactive

cursor bot reviewed Nov 20, 2025

View reviewed changes

dcramer force-pushed the eval-action branch from c135c75 to 492bde1 Compare November 20, 2025 02:26

dcramer temporarily deployed to Actions November 20, 2025 02:26 — with GitHub Actions Inactive

dcramer force-pushed the eval-action branch from 492bde1 to 10f06a1 Compare November 20, 2025 02:36

dcramer temporarily deployed to Actions November 20, 2025 02:36 — with GitHub Actions Inactive

sentry bot reviewed Nov 20, 2025

View reviewed changes

cursor bot reviewed Nov 20, 2025

View reviewed changes

.github/workflows/eval.yml Outdated Show resolved Hide resolved

dcramer force-pushed the eval-action branch from 10f06a1 to 9ba6a7c Compare November 20, 2025 02:49

dcramer temporarily deployed to Actions November 20, 2025 02:49 — with GitHub Actions Inactive

dcramer merged commit d44a078 into main Nov 20, 2025
16 checks passed

dcramer deleted the eval-action branch November 20, 2025 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(evals): add GitHub Check Run reporting for evaluation results #640

ci(evals): add GitHub Check Run reporting for evaluation results #640

Uh oh!

dcramer commented Nov 20, 2025

Uh oh!

cursor bot Nov 20, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ci(evals): add GitHub Check Run reporting for evaluation results #640

ci(evals): add GitHub Check Run reporting for evaluation results #640

Uh oh!

Conversation

dcramer commented Nov 20, 2025

Uh oh!

cursor bot Nov 20, 2025

Choose a reason for hiding this comment

Bug: Incorrect threshold message for passing score

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants