-
Notifications
You must be signed in to change notification settings - Fork 62
ci(evals): add GitHub Check Run reporting for evaluation results #640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
.github/scripts/create-eval-check.js
Outdated
| `### Conclusion`, | ||
| ``, | ||
| conclusion === "success" | ||
| ? `✅ **Passed**: Average score (${avgScore.toFixed(2)}) is above the catastrophic failure threshold (0.50)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Incorrect threshold message for passing score
The success message says the score "is above" the threshold, but the actual check uses >= which includes scores equal to 0.50. When the average score is exactly 0.50, the check passes but displays "Average score (0.50) is above the catastrophic failure threshold (0.50)", which is mathematically incorrect. The message needs to say "is at or above" to match the >= comparison.
| if: ${{ !cancelled() }} | ||
| uses: codecov/test-results-action@v1 | ||
| with: | ||
| token: ${{ secrets.CODECOV_TOKEN }} |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| evalResults.push({ | ||
| name: test.fullName || test.title, | ||
| file: testFile.name, | ||
| avgScore: test.meta.eval.avgScore ?? null, | ||
| scores: test.meta.eval.scores || [], | ||
| passed: test.status === 'passed', | ||
| duration: test.duration, | ||
| }); | ||
| } | ||
| } | ||
| } |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
Create automated check runs that report eval scores and statistics: - Add create-eval-check.js script to parse eval-results.json and create GitHub Check Runs - Update eval:ci to output JSON results via --reporter=json - Set check conclusion to 'success' if avg score >= 0.5, 'failure' otherwise - Display overall statistics and score distribution (green/yellow/red) - Run check creation even if evals fail (continue-on-error + !cancelled) - Add checks:write permission to eval workflow - Ignore generated eval-results.json in version control The check run provides: - Overall average score with pass/fail threshold (0.50) - Score distribution by category (>=0.75, 0.50-0.74, <0.50) - Individual eval scores sorted by performance - Rationale for failed or low-scoring tests Co-Authored-By: Claude Code <[email protected]>
Create automated check runs that report eval scores and statistics:
The check run provides: