fix: display eval status per metric type #305

stefanoamorelli · 2025-11-29T18:15:54Z

When viewing eval results, if response_match_score failed but tool_trajectory_avg_score passed, all messages in the invocation (including tool calls) incorrectly showed ❌. This can be confusing because the tool trajectory is actually correct.

To address this issue, this PR introduces an isToolRelatedEvent() helper to identify events involving tool calls. The addEvalCaseResultToEvents() method now assigns the metric based on event type:

Tool events → tool_trajectory_avg_score
Text responses → response_match_score

This solution hardcodes the mapping above. It works for the two current default metrics but it does not automatically support custom or future metrics.

Fixes #187 with minimal frontend-only changes, but long-term I would recommend a backend API change for a more scalable solution, such as including metadata on metrics indicating which event types they evaluate (for example something along the lines of: appliesTo: 'tool' | 'response')

Previously, when viewing eval results, all messages in an invocation showed the same pass/fail status. If response_match_score failed, tool calls would incorrectly show ❌ even when tool_trajectory_avg_score passed. Now, tool-related events (functionCall, functionResponse) display the tool_trajectory_avg_score result, while text responses display the response_match_score result. This gives accurate per-metric feedback in the eval UI. Fixes google#187

stefanoamorelli marked this pull request as draft November 29, 2025 18:16

stefanoamorelli changed the title ~~fix: display eval status per metric type instead of overall status~~ [draft] fix: display eval status per metric type Nov 29, 2025

stefanoamorelli force-pushed the fix/eval-metric-display-per-message-type branch from 1a0a1bb to 4a8d456 Compare November 29, 2025 18:22

stefanoamorelli mentioned this pull request Nov 30, 2025

Eval bug: All metrics appear as failed for an eval case if any fail #187

Open

stefanoamorelli changed the title ~~[draft] fix: display eval status per metric type~~ fix: display eval status per metric type Nov 30, 2025

stefanoamorelli marked this pull request as ready for review November 30, 2025 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: display eval status per metric type #305

fix: display eval status per metric type #305

Uh oh!

stefanoamorelli commented Nov 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: display eval status per metric type #305

Are you sure you want to change the base?

fix: display eval status per metric type #305

Uh oh!

Conversation

stefanoamorelli commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stefanoamorelli commented Nov 29, 2025 •

edited

Loading