fix(eval): empty expected.args no longer vacuously passes in args scorer by ajay-kesavan · Pull Request #1735 · UiPath/uipath-python

ajay-kesavan · 2026-06-18T17:53:33Z

Summary

`tool_calls_args_score` iterated over `expected.args` and any tool call whose expected.args was `{}` vacuously passed because `all([])` is `True` in Python.
Result: an evaluator configured with `{ "name": "X", "args": {} }` scored 1.0 regardless of the actual call's arguments. Both `subset=True` and `subset=False` modes hit this — the loop just doesn't execute when expected.args is empty.
This is the most common eval-authoring failure mode: users hit "add tool" in the picker, name the tool, never fill in args, and silently get a green check on runs that aren't validating anything.
Add a guard: empty expected.args + non-empty actual.args = mismatch in either mode. Empty + empty still matches.

Why this matters

Caught during end-to-end testing of #1733 (sanitised-name match). With #1733, the sanitiser correctly resolved `Web Search` (display) ↔ `Web_Search` (sanitised runtime) and the args evaluator then "matched" with score 1.0 — but the score was meaningless because expected.args was `{}`. The user's authoring path through the Studio Web evaluator picker produced this shape by default.

Behavior matrix

expected.args	actual.args	Before	After	Notes
`{}`	`{}`	1.0	1.0	regression check
`{}`	`{a: 1}`	1.0	0.0	the bug being fixed
`{a: 1}`	`{a: 1}`	1.0	1.0	unchanged
`{a: 1}`	`{a: 1, b: 2}`	1.0	1.0	subset/exact unchanged on non-empty expected
`{a: 1, b: 2}`	`{a: 1}`	0.0	0.0	KeyError → False, unchanged

The change is targeted at the empty-expected case only. Non-empty expected paths are not touched — `subset=True` keeps subset semantics, `subset=False` keeps its KeyError-on-missing semantics.

Test plan

`pytest tests/evaluators/test_evaluator_helpers.py::TestToolCallsArgsScore` — 11 passed (8 existing + 3 new)
`pytest tests/cli/eval/ tests/evaluators/` — 873 passed, full eval surface unchanged
`ruff check` / `ruff format --check` / `mypy` on touched files — all green

Risk / migration

Any existing eval-set that was unintentionally using `args: {}` as a placeholder will start scoring 0 instead of 1.0 — that's the intended outcome (the green check was hiding the fact that the criterion wasn't validating anything). No code-level migration needed; users authoring fresh eval sets should fill in real expected args.

Independent of, and complementary to, #1733.

🤖 Generated with Claude Code

tool_calls_args_score iterated over expected.args via: all(args_check(k, v) for k, v in expected_tool_call.args.items()) When expected.args was {}, the generator iterated zero times and all([]) returned True, causing the per-call args check to vacuously pass regardless of what the actual tool call passed. The most common authoring failure surfaced from this: an evaluator configured with toolCalls: [{ "name": "X", "args": {} }] would always score 1.0 even when the agent passed substantial real arguments, masking what the user intended as "real validation". Both subset=True and subset=False modes hit this — the loop just doesn't execute. Add an explicit guard: empty expected.args with non-empty actual.args is a mismatch in either mode. Empty expected with empty actual still matches (1.0). Non-empty expected paths are unchanged. Three new tests cover the guard: * empty expected + non-empty actual under subset=False → 0.0 * empty expected + non-empty actual under subset=True → 0.0 * empty expected + empty actual → 1.0 (regression check) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Fixes a correctness bug in the tool-call arguments scorer where an empty expected.args would previously “match” any actual args due to all([]) == True, causing unintended 1.0 scores for effectively non-validating criteria. This fits into the eval framework’s tool-call evaluators by making argument validation meaningful even when criteria are authored with default/empty args.

Changes:

Add an explicit guard in tool_calls_args_score so expected.args == {} only matches when actual.args == {}, and fails when actual args are non-empty.
Add targeted regression tests covering empty-expected vs non-empty-actual in both subset and non-subset modes, plus the empty/empty pass case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
packages/uipath/src/uipath/eval/_helpers/evaluators_helpers.py	Adds a guard to prevent vacuous `all([])` passing when `expected.args` is empty but actual args are present.
packages/uipath/tests/evaluators/test_evaluator_helpers.py	Adds regression tests validating the new empty-expected behavior (fail on non-empty actual; pass on empty actual).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sonarqubecloud · 2026-06-18T17:56:25Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot AI review requested due to automatic review settings June 18, 2026 17:53

github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels Jun 18, 2026

Copilot started reviewing on behalf of ajay-kesavan June 18, 2026 17:54 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

ajay-kesavan closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): empty expected.args no longer vacuously passes in args scorer#1735

fix(eval): empty expected.args no longer vacuously passes in args scorer#1735
ajay-kesavan wants to merge 1 commit into
mainfrom
ajay/eval-args-empty-expected-fix

ajay-kesavan commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajay-kesavan commented Jun 18, 2026

Summary

Why this matters

Behavior matrix

Test plan

Risk / migration

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants