fix: display eval status per metric type #305
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When viewing eval results, if
response_match_scorefailed buttool_trajectory_avg_scorepassed, all messages in the invocation (including tool calls) incorrectly showed ❌. This can be confusing because the tool trajectory is actually correct.To address this issue, this PR introduces an
isToolRelatedEvent()helper to identify events involving tool calls. TheaddEvalCaseResultToEvents()method now assigns the metric based on event type:tool_trajectory_avg_scoreresponse_match_scoreThis solution hardcodes the mapping above. It works for the two current default metrics but it does not automatically support custom or future metrics.
Fixes #187 with minimal frontend-only changes, but long-term I would recommend a backend API change for a more scalable solution, such as including metadata on metrics indicating which event types they evaluate (for example something along the lines of:
appliesTo: 'tool' | 'response')