Skip to content

fix(eval): empty expected.args no longer vacuously passes in args scorer#1735

Closed
ajay-kesavan wants to merge 1 commit into
mainfrom
ajay/eval-args-empty-expected-fix
Closed

fix(eval): empty expected.args no longer vacuously passes in args scorer#1735
ajay-kesavan wants to merge 1 commit into
mainfrom
ajay/eval-args-empty-expected-fix

Conversation

@ajay-kesavan

Copy link
Copy Markdown
Contributor

Summary

  • `tool_calls_args_score` iterated over `expected.args` and any tool call whose expected.args was `{}` vacuously passed because `all([])` is `True` in Python.
  • Result: an evaluator configured with `{ "name": "X", "args": {} }` scored 1.0 regardless of the actual call's arguments. Both `subset=True` and `subset=False` modes hit this — the loop just doesn't execute when expected.args is empty.
  • This is the most common eval-authoring failure mode: users hit "add tool" in the picker, name the tool, never fill in args, and silently get a green check on runs that aren't validating anything.
  • Add a guard: empty expected.args + non-empty actual.args = mismatch in either mode. Empty + empty still matches.

Why this matters

Caught during end-to-end testing of #1733 (sanitised-name match). With #1733, the sanitiser correctly resolved `Web Search` (display) ↔ `Web_Search` (sanitised runtime) and the args evaluator then "matched" with score 1.0 — but the score was meaningless because expected.args was `{}`. The user's authoring path through the Studio Web evaluator picker produced this shape by default.

Behavior matrix

expected.args actual.args Before After Notes
`{}` `{}` 1.0 1.0 regression check
`{}` `{a: 1}` 1.0 0.0 the bug being fixed
`{a: 1}` `{a: 1}` 1.0 1.0 unchanged
`{a: 1}` `{a: 1, b: 2}` 1.0 1.0 subset/exact unchanged on non-empty expected
`{a: 1, b: 2}` `{a: 1}` 0.0 0.0 KeyError → False, unchanged

The change is targeted at the empty-expected case only. Non-empty expected paths are not touched — `subset=True` keeps subset semantics, `subset=False` keeps its KeyError-on-missing semantics.

Test plan

  • `pytest tests/evaluators/test_evaluator_helpers.py::TestToolCallsArgsScore` — 11 passed (8 existing + 3 new)
  • `pytest tests/cli/eval/ tests/evaluators/` — 873 passed, full eval surface unchanged
  • `ruff check` / `ruff format --check` / `mypy` on touched files — all green

Risk / migration

Any existing eval-set that was unintentionally using `args: {}` as a placeholder will start scoring 0 instead of 1.0 — that's the intended outcome (the green check was hiding the fact that the criterion wasn't validating anything). No code-level migration needed; users authoring fresh eval sets should fill in real expected args.

Independent of, and complementary to, #1733.

🤖 Generated with Claude Code

tool_calls_args_score iterated over expected.args via:
    all(args_check(k, v) for k, v in expected_tool_call.args.items())
When expected.args was {}, the generator iterated zero times and
all([]) returned True, causing the per-call args check to vacuously
pass regardless of what the actual tool call passed.

The most common authoring failure surfaced from this: an evaluator
configured with toolCalls: [{ "name": "X", "args": {} }] would always
score 1.0 even when the agent passed substantial real arguments,
masking what the user intended as "real validation". Both subset=True
and subset=False modes hit this — the loop just doesn't execute.

Add an explicit guard: empty expected.args with non-empty actual.args
is a mismatch in either mode. Empty expected with empty actual still
matches (1.0). Non-empty expected paths are unchanged.

Three new tests cover the guard:
  * empty expected + non-empty actual under subset=False → 0.0
  * empty expected + non-empty actual under subset=True → 0.0
  * empty expected + empty actual → 1.0 (regression check)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 18, 2026 17:53
@github-actions github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels Jun 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a correctness bug in the tool-call arguments scorer where an empty expected.args would previously “match” any actual args due to all([]) == True, causing unintended 1.0 scores for effectively non-validating criteria. This fits into the eval framework’s tool-call evaluators by making argument validation meaningful even when criteria are authored with default/empty args.

Changes:

  • Add an explicit guard in tool_calls_args_score so expected.args == {} only matches when actual.args == {}, and fails when actual args are non-empty.
  • Add targeted regression tests covering empty-expected vs non-empty-actual in both subset and non-subset modes, plus the empty/empty pass case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
packages/uipath/src/uipath/eval/_helpers/evaluators_helpers.py Adds a guard to prevent vacuous all([]) passing when expected.args is empty but actual args are present.
packages/uipath/tests/evaluators/test_evaluator_helpers.py Adds regression tests validating the new empty-expected behavior (fail on non-empty actual; pass on empty actual).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-integrations test:uipath-langchain Triggers tests in the uipath-langchain-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants