Skip to content

Conversation

@dcramer
Copy link
Member

@dcramer dcramer commented Nov 20, 2025

Create automated check runs that report eval scores and statistics:

  • Add create-eval-check.js script to parse eval-results.json and create GitHub Check Runs
  • Update eval:ci to output JSON results via --reporter=json
  • Set check conclusion to 'success' if avg score >= 0.5, 'failure' otherwise
  • Display overall statistics and score distribution (green/yellow/red)
  • Run check creation even if evals fail (continue-on-error + !cancelled)
  • Add checks:write permission to eval workflow
  • Ignore generated eval-results.json in version control

The check run provides:

  • Overall average score with pass/fail threshold (0.50)
  • Score distribution by category (>=0.75, 0.50-0.74, <0.50)
  • Individual eval scores sorted by performance
  • Rationale for failed or low-scoring tests

`### Conclusion`,
``,
conclusion === "success"
? `✅ **Passed**: Average score (${avgScore.toFixed(2)}) is above the catastrophic failure threshold (0.50)`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect threshold message for passing score

The success message says the score "is above" the threshold, but the actual check uses >= which includes scores equal to 0.50. When the average score is exactly 0.50, the check passes but displays "Average score (0.50) is above the catastrophic failure threshold (0.50)", which is mathematically incorrect. The message needs to say "is at or above" to match the >= comparison.

Fix in Cursor Fix in Web

if: ${{ !cancelled() }}
uses: codecov/test-results-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}

This comment was marked as outdated.

Comment on lines +94 to +104
evalResults.push({
name: test.fullName || test.title,
file: testFile.name,
avgScore: test.meta.eval.avgScore ?? null,
scores: test.meta.eval.scores || [],
passed: test.status === 'passed',
duration: test.duration,
});
}
}
}

This comment was marked as outdated.

Create automated check runs that report eval scores and statistics:
- Add create-eval-check.js script to parse eval-results.json and create GitHub Check Runs
- Update eval:ci to output JSON results via --reporter=json
- Set check conclusion to 'success' if avg score >= 0.5, 'failure' otherwise
- Display overall statistics and score distribution (green/yellow/red)
- Run check creation even if evals fail (continue-on-error + !cancelled)
- Add checks:write permission to eval workflow
- Ignore generated eval-results.json in version control

The check run provides:
- Overall average score with pass/fail threshold (0.50)
- Score distribution by category (>=0.75, 0.50-0.74, <0.50)
- Individual eval scores sorted by performance
- Rationale for failed or low-scoring tests

Co-Authored-By: Claude Code <[email protected]>
@dcramer dcramer merged commit d44a078 into main Nov 20, 2025
16 checks passed
@dcramer dcramer deleted the eval-action branch November 20, 2025 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants