Skip to content

Conversation

@kdudka
Copy link
Collaborator

@kdudka kdudka commented Dec 3, 2025

What

  • chore: update pinned dependencies in uv.lock
  • chore: temporarily downgrade logfire pinned dependency to eliminate OTEL-related warnings
  • evals: print more user-friendly failure messages
  • evals: provide reason for SuggestCweEvaluator score
  • evals: provide reason for CVSSVectorEvaluator score
  • fix: rework the suggest-impact LLM prompt template
  • fix: rework the suggest-statement LLM prompt template
  • evals: rework the StatementEvaluator rubric
  • fix: decouple OSIDB fields exclusion from retrieval
  • evals: annotate specific evaluators for specific evaluation cases that are known to fail
  • ci: do not set AEGIS_EVALS_MIN_PASSED any more

It turned out that certain evaluation cases had been passing only by mistake. After making the field exclusion work as expected in the evaluation suite, we had to apply a couple of fixes (and annotate some evaluation cases as known to fail) to make the CI green again.

Why

The one function does all approach did not work well with osidb_cache.

When OSIDB data was cached for the first time (after fields exclusion was implemented), the data was stored with the fields excluded. Then, when the data was used for evaluation of another feature, wrong fields were provided as input for the evaluation. For legacy cached data, no fields were excluded during evaluation at all.

How to Test

See the CI results.

Related Tickets

Fixes: commit 4718c58
Fixes: commit 2b9fa21
Related: https://issues.redhat.com/browse/AEGIS-265

Summary by Sourcery

Decouple OSIDB CVE field exclusion from data retrieval and update CVE-related LLM prompts and evaluation harnesses to produce clearer outputs and more reliable, annotated eval results.

Bug Fixes:

  • Ensure OSIDB CVE records are always retrieved in full and have fields excluded afterward so cached data is consistent across features.
  • Improve CVSS vector comparison evaluation to return both a confidence-adjusted score and human-readable reasons for mismatches.

Enhancements:

  • Refine the suggest-impact and suggest-statement LLM prompt templates for clearer CVSS guidance, Red Hat–specific context, and more actionable mitigation patterns.
  • Extend OSIDB CVE model defaults and add a reusable helper to strip excluded fields, including Red Hat CVSS scores, from retrieved data.
  • Improve evaluation reporting by including evaluator names and reasons, and by skipping threshold failures for evaluators marked as known-to-fail per case.
  • Allow CVE evaluation cases to carry per-evaluator metadata and expand CWE and statement test suites with known-to-fail annotations.
  • Clarify evaluation rubrics for suggested statements and descriptions to better capture semantic equivalence and avoid penalizing benign version-like content.
  • Configure logging to import logfire lazily only when OpenTelemetry is enabled.

Build:

  • Bump the pinned logfire dependency to version 4.15.1 in project configuration and lockfile.

CI:

  • Remove the AEGIS_EVALS_MIN_PASSED override from the Tekton pipeline so CI relies on per-evaluator scoring and known-to-fail annotations.

Tests:

  • Update CVE-related eval test cases (CWE, impact, statement, description) to support metadata, improved rubrics, and newly annotated known-to-fail evaluators, and to exercise the new evaluation reasoning output.

@kdudka kdudka self-assigned this Dec 3, 2025
@kdudka
Copy link
Collaborator Author

kdudka commented Dec 4, 2025

@sourcery-ai review

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In score_cvss3_diff, the reason string is built even when there are no metric mismatches; consider returning None or skipping the prefix when mismatch_list is empty so callers don't see a misleading "mismatched metrics:" message for near-identical vectors.
  • The eval_name_from_result helper assumes result.source.arguments['score']['evaluation_name'] always exists; it might be safer to guard against KeyError/TypeError and fall back to a generic label to avoid unexpected crashes if the LLM judge payload shape changes.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `score_cvss3_diff`, the reason string is built even when there are no metric mismatches; consider returning `None` or skipping the prefix when `mismatch_list` is empty so callers don't see a misleading "mismatched metrics:" message for near-identical vectors.
- The `eval_name_from_result` helper assumes `result.source.arguments['score']['evaluation_name']` always exists; it might be safer to guard against `KeyError`/`TypeError` and fall back to a generic label to avoid unexpected crashes if the LLM judge payload shape changes.

## Individual Comments

### Comment 1
<location> `src/aegis_ai/features/cve/__init__.py:74-85` </location>
<code_context>
+                    - Use github mcp and web search tools to resolve reference URLs.
+                    - Always use kernel_cve tool if the component is the Linux kernel.
+                    - If cisa_kev_tool is available, check for known exploits.
+                - Confidence:
+                    - Calibrate confidence to the fraction of base metrics you are ≥80% sure about (e.g., 0.75 if 6/8 are certain).
                 - Output
-                    - output a plausible CVSS 3.1 base vector and score.
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Confidence is described but no longer mentioned in the output contract.

The rules still define how to calibrate a Confidence value, but the Output section no longer specifies a confidence field (it used to). This mismatch can confuse the model and any schema-dependent consumers. Please either add confidence back into the Output schema (with a clear field name/location) or remove the Confidence section so the instructions are self‑consistent.

```suggestion
                - Retrieve and summarize additional context from vulnerability references:
                    - Use github mcp and web search tools to resolve reference URLs.
                    - Always use kernel_cve tool if the component is the Linux kernel.
                    - If cisa_kev_tool is available, check for known exploits.
                - Confidence:
                    - Calibrate confidence to the fraction of base metrics you are ≥80% sure about (e.g., 0.75 if 6/8 are certain).
                - Output
                    - Provide the CVSS 3.1 base vector string, the numeric base score, and a numeric confidence value between 0 and 1 named `confidence`, calibrated as described above.
                    - Provide the vector and score first, then impact, then a concise explanation with metric-by-metric rationale.
                    - Keep explanations concise.
            """,
            context=CVEFeatureInput(cve_id=cve_id),
            static_context=remove_keys(
        )
```
</issue_to_address>

### Comment 2
<location> `evals/features/common.py:153-160` </location>
<code_context>
     return EvaluationReason(value=value, reason=(fail_reason if not value else None))


+def eval_name_from_result(result):
+    """return human-readable evaluator name associated with the evaluation result"""
+    try:
+        # This works for our custom evaluators
+        return result.name
+    except AttributeError:
+        # This works for a scoring LLMJudge
+        return result.source.arguments["score"]["evaluation_name"]
+
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Assumed structure of `result.source.arguments` may raise runtime KeyErrors.

In the fallback branch, this assumes `result.source.arguments["score"]["evaluation_name"]` is always present. If an evaluator changes its argument shape or omits these keys, this will raise a `KeyError` instead of degrading gracefully. Consider using safe access (e.g. `get`/membership checks) and falling back to a generic name (or `str(result.source)`) when those keys are missing.
</issue_to_address>

### Comment 3
<location> `evals/features/common.py:193-201` </location>
<code_context>
+    for ecase in report.failures:
+        failures += f"{ecase.name}: case failure: {ecase.error_message}\n"

     # iterate through evaluated cases
-    for case in report.cases:
+    for ecase in report.cases:
         # bool assertions
-        for result in case.assertions.values():
+        for result in ecase.assertions.values():
             if result.value is False:
-                failures += f"{case.name}: {result.source}: {result.reason}\n"
+                failures += f"{ecase.name}: {result.name}: False"
+                if result.reason:
+                    failures += f", reason: {result.reason}"
+                failures += "\n"

         # evaluator failures
</code_context>

<issue_to_address>
**suggestion:** Failure messages for boolean assertions changed and no longer surface evaluator identity consistently.

The previous format logged `{case.name}: {result.source}: {result.reason}`, which let us easily tie failures to specific evaluators/checks. The new format uses `{ecase.name}: {result.name}: False[, reason: ...]` and drops `result.source`, so identical assertion names across evaluators become harder to disambiguate. Please include a stable evaluator identifier (e.g., `result.source` or a derived `eval_name_from_result`) alongside `result.name` to keep these failure logs unambiguous.

```suggestion
    # iterate through evaluated cases
    for ecase in report.cases:
        # bool assertions
        for result in ecase.assertions.values():
            if result.value is False:
                # include evaluator identity (result.source) to keep logs unambiguous
                failures += f"{ecase.name}: {result.source}: {result.name}: False"
                if result.reason:
                    failures += f", reason: {result.reason}"
                failures += "\n"
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

... so that we use supported versions of the modules we depend on.

Related: https://issues.redhat.com/browse/AEGIS-265
... to eliminate the following OTEL-related warnings:
```
<frozen abc>:106
<frozen abc>:106
  <frozen abc>:106: DeprecationWarning: You should use `Logger` instead. Deprecated since version 1.39.0 and will be removed in a future release.

<frozen abc>:106
<frozen abc>:106
  <frozen abc>:106: DeprecationWarning: You should use `LoggerProvider` instead. Deprecated since version 1.39.0 and will be removed in a future release.

.venv/lib64/python3.13/site-packages/opentelemetry/_events/__init__.py:201
  /home/kdudka/git/aegis/.venv/lib64/python3.13/site-packages/opentelemetry/_events/__init__.py:201: DeprecationWarning: You should use `ProxyLoggerProvider` instead. Deprecated since version 1.39.0 and will be removed in a future release.
    _PROXY_EVENT_LOGGER_PROVIDER = ProxyEventLoggerProvider()
```

Related: https://issues.redhat.com/browse/AEGIS-265
It is a soft keyword in Python 3.10+ used in the match/case statement.

Related: https://issues.redhat.com/browse/AEGIS-265
... when an evaluator assigns a score below the threshold.

Related: https://issues.redhat.com/browse/AEGIS-265
@kdudka kdudka force-pushed the exclude-fields branch 2 times, most recently from 64dcb5f to 9f96090 Compare December 5, 2025 11:12
@kdudka
Copy link
Collaborator Author

kdudka commented Dec 5, 2025

@sourcery-ai review

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In evals/features/common.py, eval_name_from_result assumes the LLM judge result structure always has source.arguments['score']['evaluation_name']; consider using .get lookups (and a sensible fallback) to avoid KeyErrors if the underlying result schema changes.
  • In cve_exclude_fields, it may be worth normalizing exclude_fields to an empty list when None and preventing exclusion of required fields like cve_id, so that unexpected input or future feature_deps values don't cause CVE instantiation errors.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `evals/features/common.py`, `eval_name_from_result` assumes the LLM judge result structure always has `source.arguments['score']['evaluation_name']`; consider using `.get` lookups (and a sensible fallback) to avoid KeyErrors if the underlying result schema changes.
- In `cve_exclude_fields`, it may be worth normalizing `exclude_fields` to an empty list when `None` and preventing exclusion of required fields like `cve_id`, so that unexpected input or future feature_deps values don't cause `CVE` instantiation errors.

## Individual Comments

### Comment 1
<location> `src/aegis_ai/toolsets/tools/osidb/__init__.py:93-102` </location>
<code_context>
+    )
+
+
+def cve_exclude_fields(cve: CVE, exclude_fields: List[str]):
+    """return a CVE object with data removed in fields specified by exclude_fields"""
+    # "cve_description" is used in OSIM, "description" is used in OSIDB
+    fields_to_exclude = set(
+        [field.replace("cve_description", "description") for field in exclude_fields]
+    )
+
+    # create a local copy so that we can change the CVE object
+    cve = cve.model_copy()
+    if "rh_cvss_score" in fields_to_exclude:
+        # exclude RH-provided CVSS
+        cve.cvss_scores = [cvss for cvss in cve.cvss_scores if cvss["issuer"] != "RH"]
+
+    # finally remove all fields listed in fields_to_exclude
+    filtered_dump = cve.model_dump(exclude=fields_to_exclude)
+    return CVE(**filtered_dump)

</code_context>

<issue_to_address>
**issue (bug_risk):** Handle a possible None value for exclude_fields in cve_exclude_fields to avoid runtime errors.

Since `flaw_tool` and `osidb_tool` now pass `ctx.deps.exclude_osidb_fields` directly to `cve_exclude_fields`, a `None` (or other non-list falsy value) will cause a `TypeError` when iterating or building the set. Please normalize this parameter inside `cve_exclude_fields` (e.g., `exclude_fields = exclude_fields or []`) or type it as `Optional[List[str]]` and handle `None` explicitly.
</issue_to_address>

### Comment 2
<location> `evals/features/cve/test_suggest_impact.py:36-45` </location>
<code_context>
 # TODO: check whether the cvss Python module could anyhow help with this
-def score_cvss3_diff(cvss3: str, cvss3_exp: str) -> float:
-    """check how a pair of CVSS 3.1 vectors differ from each other.  0.0 means completely different, 1.0 means exact match"""
+def score_cvss3_diff(cvss3: str, cvss3_exp: str) -> tuple[float, str | None]:
+    """Compare two CVSS 3.1 vectors and return (score, reason).
+    0.0 means completely different, 1.0 means exact match.
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding direct unit tests for `score_cvss3_diff` to cover parsing and the new reason output

This function now handles parsing, metric normalization, distance computation, and human-readable reason construction, but none of this is covered by direct tests.

Please add focused unit tests in this module to cover cases such as:
- Exact match: identical vectors return `(1.0, None)`.
- Single-metric difference: e.g. only `AV` differs, score < 1 and `reason` mentions `AV`.
- Multiple metric differences: `reason` lists all mismatched metrics and the score reflects greater distance.
- Malformed/partial vectors: missing/invalid metrics either produce predictable behavior or trigger the `CVSSVectorEvaluator` exception path with a reason like `"unhandled exception: ..."`.

This will keep `CVSSVectorEvaluator` behavior reliable as vectors and formats change.

Suggested implementation:

```python
# TODO: check whether the cvss Python module could anyhow help with this
def score_cvss3_diff(cvss3: str, cvss3_exp: str) -> tuple[float, str | None]:
    """Compare two CVSS 3.1 vectors and return (score, reason).
    0.0 means completely different, 1.0 means exact match.
    When the score is not 1.0, reason enumerates mismatched metrics."""
    if cvss3 == cvss3_exp:
        # exact match
        return (1.0, None)

    def _parse(v: str) -> dict[str, str]:
        parts = v.split("/")
    metrics = tuple(scales.keys())
    diffs = [_norm_dist(m) for m in metrics]


def test_score_cvss3_diff_exact_match() -> None:
    base_vector = "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"

    score, reason = score_cvss3_diff(base_vector, base_vector)

    assert score == 1.0
    assert reason is None


def test_score_cvss3_diff_single_metric_difference() -> None:
    base_vector = "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
    # Only AV differs
    changed_vector = "CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"

    score, reason = score_cvss3_diff(base_vector, changed_vector)

    assert 0.0 < score < 1.0
    assert reason is not None
    # reason should mention the differing metric
    assert "AV" in reason


def test_score_cvss3_diff_multiple_metric_differences() -> None:
    base_vector = "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
    # Multiple metrics differ (AV and AC here)
    changed_vector = "CVSS:3.1/AV:L/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:H"

    score_single, reason_single = score_cvss3_diff(
        base_vector,
        "CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H",
    )
    score_multi, reason_multi = score_cvss3_diff(base_vector, changed_vector)

    # More differences should not yield a higher similarity score
    assert score_multi <= score_single
    assert reason_multi is not None
    assert "AV" in reason_multi
    assert "AC" in reason_multi


def test_score_cvss3_diff_malformed_vector_unhandled_exception() -> None:
    # Clearly malformed / partial vectors to exercise the exception-handling path
    malformed = "not-a-cvss-vector"
    base_vector = "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"

    score, reason = score_cvss3_diff(base_vector, malformed)

    assert reason is not None
    # The implementation is expected to wrap unexpected failures with this prefix
    assert reason.startswith("unhandled exception:")
    # Score should not falsely indicate a perfect match
    assert score <= 1.0

```

Because I only see a fragment of `score_cvss3_diff`, you should:

1. Ensure these tests are at module scope (no indentation) alongside your other `test_...` functions in `evals/features/cve/test_suggest_impact.py`.
2. Adjust the sample CVSS vectors if your implementation expects a slightly different vector prefix or format (e.g., `CVSS:3.0` vs `CVSS:3.1`), but keep the single-metric and multi-metric differences intact.
3. If the actual error-reporting format for malformed vectors differs (for example, a different prefix than `"unhandled exception:"`), update the `assert reason.startswith("unhandled exception:")` line to match the concrete behavior of `CVSSVectorEvaluator` in your codebase.
4. If your test framework or style guidelines differ (e.g., using `unittest.TestCase` instead of plain pytest functions), wrap these assertions accordingly in your preferred testing style.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@kdudka kdudka marked this pull request as ready for review December 5, 2025 11:42
... so that its fields can be easily excluded without
unnecessary code duplication.

Related: https://issues.redhat.com/browse/AEGIS-265
The one function does all approach did not work well with `osidb_cache`.

When OSIDB data was cached for the first time (after fields exclusion
was implemented), the data was stored with the fields excluded.  Then,
when the data was used for evaluation of another feature, wrong fields
were provided as input for the evaluation.  For legacy cached data, no
fields were excluded during evaluation at all.

It turned out that certain evaluation cases had been passing only by
mistake.  After fixing this bug, they had to be commented out to keep
the CI green.

Fixes: commit 4718c58
Related: https://issues.redhat.com/browse/AEGIS-265
It turned out that evals have been passing only because `osidb_cache`
was feeding LLM with unfiltered data from OSIDB, which contained also
the fields to be suggested.

Related: https://issues.redhat.com/browse/AEGIS-265
It turned out that evals have been passing only because `osidb_cache`
was feeding LLM with unfiltered data from OSIDB, which contained also
the fields to be suggested.

Fixes: commit 2b9fa21
Related: https://issues.redhat.com/browse/AEGIS-265
It turned out that evals have been passing only because `osidb_cache`
was feeding LLM with unfiltered data from OSIDB, which contained also
the fields to be suggested.

Fixes: commit 2b9fa21
Related: https://issues.redhat.com/browse/AEGIS-265
... that are known to fail.  Evaluate them, count their score
in the overall average but do not fail if their score is below
the specified threshold.

This way we can monitor improvements (or regressions) even on the cases
that Aegis does not handle perfectly for now.

Related: https://issues.redhat.com/browse/AEGIS-265
... so that it is easier to annotate individual evaluation cases.
No changes in behavior intended with this commit.

Related: https://issues.redhat.com/browse/AEGIS-265
Evaluate them, count their score in the overall average but do not fail
if their score is below the specified threshold.

This makes unexpected failures easier to spot in the CI logs.

Related: https://issues.redhat.com/browse/AEGIS-265
... so that the evaluation cases definitions are easier to maintain.
No changes in behavior intended with this commit.

Related: https://issues.redhat.com/browse/AEGIS-265
Evaluate them, count their score in the overall average but do not fail
if their score is below the specified threshold.

This makes unexpected failures easier to spot in the CI logs.

Related: https://issues.redhat.com/browse/AEGIS-265
... of the `suggest-description-for-CVE-2025-65073` evaluation case:
```
FAILED evals/features/cve/test_suggest_description.py::test_eval_suggest_description - AssertionError: Unsatisfied assertion(s):
suggest-description-for-CVE-2025-65073: NoVersionInfo: False, reason: The 'suggested_description' contains '/v3/' in the API endpoints '/v3/ec2tokens' and '/v3/s3tokens', which represents a version and is not part of an acronym or its explanation.
```

This is obviously a problem with `LLMJudge` rather than a problem
with the suggestion provided by Aegis.

Related: https://issues.redhat.com/browse/AEGIS-265
Now as we annotate known-to-fail evaluators for each evaluation case
individually, we do not need to ignore overall failures on the Aegis
feature level any more.

Related: https://issues.redhat.com/browse/AEGIS-265
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants