-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Throw exception if LLM output is not parseable #45158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request changes the error handling behavior across multiple evaluator classes in the azure-ai-evaluation SDK. Previously, when evaluators received unparseable LLM output (non-dictionary format), they would return fallback values (NaN for scored evaluators, 0 for binary evaluators) and log a warning. Now, they raise an EvaluationException with standardized error information.
Changes:
- Replaced fallback return values with exception raising when LLM output is not parseable
- Standardized error messages to "Evaluator returned invalid output." across all affected evaluators
- Changed error target from evaluator-specific targets to the generic
ErrorTarget.EVALUATEfor consistency with the base class pattern
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
_tool_selection/_tool_selection.py |
Throws exception instead of returning fallback value when LLM output is not a dictionary |
_tool_output_utilization/_tool_output_utilization.py |
Replaces warning log + NaN return with exception raising for invalid output |
_tool_input_accuracy/_tool_input_accuracy.py |
Throws exception instead of returning fallback value when LLM output is not a dictionary |
_tool_call_success/_tool_call_success.py |
Replaces warning log + NaN return with exception raising for invalid output |
_tool_call_accuracy/_tool_call_accuracy.py |
Throws exception instead of returning fallback value when LLM output is not a dictionary |
_task_completion/_task_completion.py |
Replaces warning log + 0 return with exception raising for invalid output |
_task_adherence/_task_adherence.py |
Replaces warning log + 0 return with exception raising for invalid output |
_response_completeness/_response_completeness.py |
Replaces warning log + NaN return with exception raising for invalid output |
_relevance/_relevance.py |
Replaces warning log + NaN return with exception raising for invalid output |
_intent_resolution/_intent_resolution.py |
Replaces warning log + NaN return with exception raising for invalid output |
_common/_base_prompty_eval.py |
Adds exception raising (already had this pattern, consolidated with other changes) |
| raise EvaluationException( | ||
| message="Evaluator returned invalid output.", | ||
| blame=ErrorBlame.SYSTEM_ERROR, | ||
| category=ErrorCategory.FAILED_EXECUTION, | ||
| target=ErrorTarget.EVALUATE, | ||
| ) |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking behavioral change that should be documented in the CHANGELOG. Previously, when the evaluator received invalid (non-parseable) output, it would return a default value (NaN or 0). Now it raises an EvaluationException. This change affects multiple evaluators: ToolSelectionEvaluator, ToolOutputUtilizationEvaluator, ToolInputAccuracyEvaluator, ToolCallSuccessEvaluator, ToolCallAccuracyEvaluator, TaskCompletionEvaluator, TaskAdherenceEvaluator, ResponseCompletenessEvaluator, RelevanceEvaluator, and IntentResolutionEvaluator. The CHANGELOG should document this under a "Breaking Changes" section.
| raise EvaluationException( | |
| message="Evaluator returned invalid output.", | |
| blame=ErrorBlame.SYSTEM_ERROR, | |
| category=ErrorCategory.FAILED_EXECUTION, | |
| target=ErrorTarget.EVALUATE, | |
| ) | |
| if logger: | |
| logger.warning( | |
| "LLM output is not a dictionary; returning NaN for the score and empty reason." | |
| ) | |
| return { | |
| f"{self._result_key}": math.nan, | |
| f"{self._result_key}_reason": "", | |
| f"{self._result_key}_result": "fail", | |
| f"{self._result_key}_threshold": self._threshold, | |
| f"{self._result_key}_prompt_tokens": prompty_output_dict.get("input_token_count", 0), | |
| f"{self._result_key}_completion_tokens": prompty_output_dict.get("output_token_count", 0), | |
| f"{self._result_key}_total_tokens": prompty_output_dict.get("total_token_count", 0), | |
| f"{self._result_key}_finish_reason": prompty_output_dict.get("finish_reason", ""), | |
| f"{self._result_key}_model": prompty_output_dict.get("model_id", ""), | |
| f"{self._result_key}_sample_input": prompty_output_dict.get("sample_input", ""), | |
| f"{self._result_key}_sample_output": prompty_output_dict.get("sample_output", ""), | |
| } |
| raise EvaluationException( | ||
| message="Evaluator returned invalid output.", | ||
| blame=ErrorBlame.SYSTEM_ERROR, | ||
| category=ErrorCategory.FAILED_EXECUTION, | ||
| target=ErrorTarget.EVALUATE, | ||
| ) |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. The existing test at line 230-244 in test_task_completion_evaluator.py expects this evaluator to return 0 when LLM output is not a dictionary, but the new code raises an exception instead. The test needs to be updated to expect an EvaluationException.
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines