Skip to content

Commit 4a599ec

Browse files
committed
evals: rework the StatementEvaluator rubric
It turned out that evals have been passing only because `osidb_cache` was feeding LLM with unfiltered data from OSIDB, which contained also the fields to be suggested. Fixes: commit 2b9fa21 Related: https://issues.redhat.com/browse/AEGIS-265
1 parent 670bce3 commit 4a599ec

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

evals/features/cve/test_suggest_statement.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,13 @@
2020
field_evaluators = {
2121
"suggested_statement": create_llm_judge(
2222
score_name="StatementEvaluator",
23-
rubric="Score how much the actual suggested_statement field is semantically equivalent to the expected suggested_statement field. If the key message is the same but the style is different, the score should not be zero. If the style is different, the score should not be 1.0.",
23+
rubric=(
24+
"Score semantic equivalence between the actual suggested_statement and the expected suggested_statement. "
25+
"Emphasize matching rationale (impact justification in RH context, preconditions, scope). "
26+
"If style differs but the core message overlaps, the score should be > 0.0 and < 1.0 depending on overlap. "
27+
"Only assign 0.0 if the actual is irrelevant to the CVE or contradicts the expected meaning. "
28+
"When partially aligned but missing details, prefer a low non-zero score (e.g., 0.12–0.3) rather than 0.0."
29+
),
2430
include_expected_output=True,
2531
),
2632
"suggested_mitigation": create_llm_judge(

0 commit comments

Comments
 (0)