fix(eval): empty expected.args no longer vacuously passes in args scorer#1735
Closed
ajay-kesavan wants to merge 1 commit into
Closed
fix(eval): empty expected.args no longer vacuously passes in args scorer#1735ajay-kesavan wants to merge 1 commit into
ajay-kesavan wants to merge 1 commit into
Conversation
tool_calls_args_score iterated over expected.args via:
all(args_check(k, v) for k, v in expected_tool_call.args.items())
When expected.args was {}, the generator iterated zero times and
all([]) returned True, causing the per-call args check to vacuously
pass regardless of what the actual tool call passed.
The most common authoring failure surfaced from this: an evaluator
configured with toolCalls: [{ "name": "X", "args": {} }] would always
score 1.0 even when the agent passed substantial real arguments,
masking what the user intended as "real validation". Both subset=True
and subset=False modes hit this — the loop just doesn't execute.
Add an explicit guard: empty expected.args with non-empty actual.args
is a mismatch in either mode. Empty expected with empty actual still
matches (1.0). Non-empty expected paths are unchanged.
Three new tests cover the guard:
* empty expected + non-empty actual under subset=False → 0.0
* empty expected + non-empty actual under subset=True → 0.0
* empty expected + empty actual → 1.0 (regression check)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a correctness bug in the tool-call arguments scorer where an empty expected.args would previously “match” any actual args due to all([]) == True, causing unintended 1.0 scores for effectively non-validating criteria. This fits into the eval framework’s tool-call evaluators by making argument validation meaningful even when criteria are authored with default/empty args.
Changes:
- Add an explicit guard in
tool_calls_args_scoresoexpected.args == {}only matches whenactual.args == {}, and fails when actual args are non-empty. - Add targeted regression tests covering empty-expected vs non-empty-actual in both subset and non-subset modes, plus the empty/empty pass case.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/uipath/src/uipath/eval/_helpers/evaluators_helpers.py | Adds a guard to prevent vacuous all([]) passing when expected.args is empty but actual args are present. |
| packages/uipath/tests/evaluators/test_evaluator_helpers.py | Adds regression tests validating the new empty-expected behavior (fail on non-empty actual; pass on empty actual). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Why this matters
Caught during end-to-end testing of #1733 (sanitised-name match). With #1733, the sanitiser correctly resolved `Web Search` (display) ↔ `Web_Search` (sanitised runtime) and the args evaluator then "matched" with score 1.0 — but the score was meaningless because expected.args was `{}`. The user's authoring path through the Studio Web evaluator picker produced this shape by default.
Behavior matrix
The change is targeted at the empty-expected case only. Non-empty expected paths are not touched — `subset=True` keeps subset semantics, `subset=False` keeps its KeyError-on-missing semantics.
Test plan
Risk / migration
Any existing eval-set that was unintentionally using `args: {}` as a placeholder will start scoring 0 instead of 1.0 — that's the intended outcome (the green check was hiding the fact that the criterion wasn't validating anything). No code-level migration needed; users authoring fresh eval sets should fill in real expected args.
Independent of, and complementary to, #1733.
🤖 Generated with Claude Code