Throw exception if LLM output is not parseable #45158

salma-elshafey · 2026-02-12T12:37:17Z

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Copilot

Pull request overview

This pull request changes the error handling behavior across multiple evaluator classes in the azure-ai-evaluation SDK. Previously, when evaluators received unparseable LLM output (non-dictionary format), they would return fallback values (NaN for scored evaluators, 0 for binary evaluators) and log a warning. Now, they raise an EvaluationException with standardized error information.

Changes:

Replaced fallback return values with exception raising when LLM output is not parseable
Standardized error messages to "Evaluator returned invalid output." across all affected evaluators
Changed error target from evaluator-specific targets to the generic ErrorTarget.EVALUATE for consistency with the base class pattern

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`_tool_selection/_tool_selection.py`	Throws exception instead of returning fallback value when LLM output is not a dictionary
`_tool_output_utilization/_tool_output_utilization.py`	Replaces warning log + NaN return with exception raising for invalid output
`_tool_input_accuracy/_tool_input_accuracy.py`	Throws exception instead of returning fallback value when LLM output is not a dictionary
`_tool_call_success/_tool_call_success.py`	Replaces warning log + NaN return with exception raising for invalid output
`_tool_call_accuracy/_tool_call_accuracy.py`	Throws exception instead of returning fallback value when LLM output is not a dictionary
`_task_completion/_task_completion.py`	Replaces warning log + 0 return with exception raising for invalid output
`_task_adherence/_task_adherence.py`	Replaces warning log + 0 return with exception raising for invalid output
`_response_completeness/_response_completeness.py`	Replaces warning log + NaN return with exception raising for invalid output
`_relevance/_relevance.py`	Replaces warning log + NaN return with exception raising for invalid output
`_intent_resolution/_intent_resolution.py`	Replaces warning log + NaN return with exception raising for invalid output
`_common/_base_prompty_eval.py`	Adds exception raising (already had this pattern, consolidated with other changes)

Copilot · 2026-02-12T12:47:48Z

...luation/azure/ai/evaluation/_evaluators/_tool_output_utilization/_tool_output_utilization.py

+        raise EvaluationException(
+            message="Evaluator returned invalid output.",
+            blame=ErrorBlame.SYSTEM_ERROR,
+            category=ErrorCategory.FAILED_EXECUTION,
+            target=ErrorTarget.EVALUATE,
+        )


This is a breaking behavioral change that should be documented in the CHANGELOG. Previously, when the evaluator received invalid (non-parseable) output, it would return a default value (NaN or 0). Now it raises an EvaluationException. This change affects multiple evaluators: ToolSelectionEvaluator, ToolOutputUtilizationEvaluator, ToolInputAccuracyEvaluator, ToolCallSuccessEvaluator, ToolCallAccuracyEvaluator, TaskCompletionEvaluator, TaskAdherenceEvaluator, ResponseCompletenessEvaluator, RelevanceEvaluator, and IntentResolutionEvaluator. The CHANGELOG should document this under a "Breaking Changes" section.

Suggested change

raise EvaluationException(

message="Evaluator returned invalid output.",

blame=ErrorBlame.SYSTEM_ERROR,

category=ErrorCategory.FAILED_EXECUTION,

target=ErrorTarget.EVALUATE,

)

if logger:

logger.warning(

"LLM output is not a dictionary; returning NaN for the score and empty reason."

)

return {

f"{self._result_key}": math.nan,

f"{self._result_key}_reason": "",

f"{self._result_key}_result": "fail",

f"{self._result_key}_threshold": self._threshold,

f"{self._result_key}_prompt_tokens": prompty_output_dict.get("input_token_count", 0),

f"{self._result_key}_completion_tokens": prompty_output_dict.get("output_token_count", 0),

f"{self._result_key}_total_tokens": prompty_output_dict.get("total_token_count", 0),

f"{self._result_key}_finish_reason": prompty_output_dict.get("finish_reason", ""),

f"{self._result_key}_model": prompty_output_dict.get("model_id", ""),

f"{self._result_key}_sample_input": prompty_output_dict.get("sample_input", ""),

f"{self._result_key}_sample_output": prompty_output_dict.get("sample_output", ""),

}

Copilot · 2026-02-12T12:47:49Z

...ion/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/_task_completion.py

+        raise EvaluationException(
+            message="Evaluator returned invalid output.",
+            blame=ErrorBlame.SYSTEM_ERROR,
+            category=ErrorCategory.FAILED_EXECUTION,
+            target=ErrorTarget.EVALUATE,
+        )


This is a breaking change. The existing test at line 230-244 in test_task_completion_evaluator.py expects this evaluator to return 0 when LLM output is not a dictionary, but the new code raises an exception instead. The test needs to be updated to expect an EvaluationException.

Throw exception if LLM output is not parseable

a7a8193

salma-elshafey requested a review from a team as a code owner February 12, 2026 12:37

Copilot AI review requested due to automatic review settings February 12, 2026 12:37

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Feb 12, 2026

Copilot started reviewing on behalf of salma-elshafey February 12, 2026 12:38 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw exception if LLM output is not parseable #45158

Throw exception if LLM output is not parseable #45158

Uh oh!

salma-elshafey commented Feb 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        raise EvaluationException(
-            message="Evaluator returned invalid output.",
-            blame=ErrorBlame.SYSTEM_ERROR,
-            category=ErrorCategory.FAILED_EXECUTION,
-            target=ErrorTarget.EVALUATE,
-        )
+        if logger:
+            logger.warning(
+                "LLM output is not a dictionary; returning NaN for the score and empty reason."
+            )
+        return {
+            f"{self._result_key}": math.nan,
+            f"{self._result_key}_reason": "",
+            f"{self._result_key}_result": "fail",
+            f"{self._result_key}_threshold": self._threshold,
+            f"{self._result_key}_prompt_tokens": prompty_output_dict.get("input_token_count", 0),
+            f"{self._result_key}_completion_tokens": prompty_output_dict.get("output_token_count", 0),
+            f"{self._result_key}_total_tokens": prompty_output_dict.get("total_token_count", 0),
+            f"{self._result_key}_finish_reason": prompty_output_dict.get("finish_reason", ""),
+            f"{self._result_key}_model": prompty_output_dict.get("model_id", ""),
+            f"{self._result_key}_sample_input": prompty_output_dict.get("sample_input", ""),
+            f"{self._result_key}_sample_output": prompty_output_dict.get("sample_output", ""),
+        }

Throw exception if LLM output is not parseable #45158

Are you sure you want to change the base?

Throw exception if LLM output is not parseable #45158

Uh oh!

Conversation

salma-elshafey commented Feb 12, 2026

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant