Skip to content

Conversation

@stefanoamorelli
Copy link

@stefanoamorelli stefanoamorelli commented Nov 29, 2025

When viewing eval results, if response_match_score failed but tool_trajectory_avg_score passed, all messages in the invocation (including tool calls) incorrectly showed ❌. This can be confusing because the tool trajectory is actually correct.

To address this issue, this PR introduces an isToolRelatedEvent() helper to identify events involving tool calls. The addEvalCaseResultToEvents() method now assigns the metric based on event type:

  • Tool eventstool_trajectory_avg_score
  • Text responsesresponse_match_score

This solution hardcodes the mapping above. It works for the two current default metrics but it does not automatically support custom or future metrics.

Fixes #187 with minimal frontend-only changes, but long-term I would recommend a backend API change for a more scalable solution, such as including metadata on metrics indicating which event types they evaluate (for example something along the lines of: appliesTo: 'tool' | 'response')

@stefanoamorelli stefanoamorelli marked this pull request as draft November 29, 2025 18:16
@stefanoamorelli stefanoamorelli changed the title fix: display eval status per metric type instead of overall status [draft] fix: display eval status per metric type Nov 29, 2025
Previously, when viewing eval results, all messages in an invocation
showed the same pass/fail status. If response_match_score failed,
tool calls would incorrectly show ❌ even when tool_trajectory_avg_score
passed.

Now, tool-related events (functionCall, functionResponse) display
the tool_trajectory_avg_score result, while text responses display
the response_match_score result. This gives accurate per-metric
feedback in the eval UI.

Fixes google#187
@stefanoamorelli stefanoamorelli force-pushed the fix/eval-metric-display-per-message-type branch from 1a0a1bb to 4a8d456 Compare November 29, 2025 18:22
@stefanoamorelli stefanoamorelli changed the title [draft] fix: display eval status per metric type fix: display eval status per metric type Nov 30, 2025
@stefanoamorelli stefanoamorelli marked this pull request as ready for review November 30, 2025 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: All metrics appear as failed for an eval case if any fail

1 participant