-
Notifications
You must be signed in to change notification settings - Fork 6
fix: decouple OSIDB fields exclusion from retrieval #370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
54bb621 to
4a599ec
Compare
|
@sourcery-ai review |
4a599ec to
7668364
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In
score_cvss3_diff, the reason string is built even when there are no metric mismatches; consider returningNoneor skipping the prefix whenmismatch_listis empty so callers don't see a misleading "mismatched metrics:" message for near-identical vectors. - The
eval_name_from_resulthelper assumesresult.source.arguments['score']['evaluation_name']always exists; it might be safer to guard againstKeyError/TypeErrorand fall back to a generic label to avoid unexpected crashes if the LLM judge payload shape changes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `score_cvss3_diff`, the reason string is built even when there are no metric mismatches; consider returning `None` or skipping the prefix when `mismatch_list` is empty so callers don't see a misleading "mismatched metrics:" message for near-identical vectors.
- The `eval_name_from_result` helper assumes `result.source.arguments['score']['evaluation_name']` always exists; it might be safer to guard against `KeyError`/`TypeError` and fall back to a generic label to avoid unexpected crashes if the LLM judge payload shape changes.
## Individual Comments
### Comment 1
<location> `src/aegis_ai/features/cve/__init__.py:74-85` </location>
<code_context>
+ - Use github mcp and web search tools to resolve reference URLs.
+ - Always use kernel_cve tool if the component is the Linux kernel.
+ - If cisa_kev_tool is available, check for known exploits.
+ - Confidence:
+ - Calibrate confidence to the fraction of base metrics you are ≥80% sure about (e.g., 0.75 if 6/8 are certain).
- Output
- - output a plausible CVSS 3.1 base vector and score.
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Confidence is described but no longer mentioned in the output contract.
The rules still define how to calibrate a Confidence value, but the Output section no longer specifies a confidence field (it used to). This mismatch can confuse the model and any schema-dependent consumers. Please either add confidence back into the Output schema (with a clear field name/location) or remove the Confidence section so the instructions are self‑consistent.
```suggestion
- Retrieve and summarize additional context from vulnerability references:
- Use github mcp and web search tools to resolve reference URLs.
- Always use kernel_cve tool if the component is the Linux kernel.
- If cisa_kev_tool is available, check for known exploits.
- Confidence:
- Calibrate confidence to the fraction of base metrics you are ≥80% sure about (e.g., 0.75 if 6/8 are certain).
- Output
- Provide the CVSS 3.1 base vector string, the numeric base score, and a numeric confidence value between 0 and 1 named `confidence`, calibrated as described above.
- Provide the vector and score first, then impact, then a concise explanation with metric-by-metric rationale.
- Keep explanations concise.
""",
context=CVEFeatureInput(cve_id=cve_id),
static_context=remove_keys(
)
```
</issue_to_address>
### Comment 2
<location> `evals/features/common.py:153-160` </location>
<code_context>
return EvaluationReason(value=value, reason=(fail_reason if not value else None))
+def eval_name_from_result(result):
+ """return human-readable evaluator name associated with the evaluation result"""
+ try:
+ # This works for our custom evaluators
+ return result.name
+ except AttributeError:
+ # This works for a scoring LLMJudge
+ return result.source.arguments["score"]["evaluation_name"]
+
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Assumed structure of `result.source.arguments` may raise runtime KeyErrors.
In the fallback branch, this assumes `result.source.arguments["score"]["evaluation_name"]` is always present. If an evaluator changes its argument shape or omits these keys, this will raise a `KeyError` instead of degrading gracefully. Consider using safe access (e.g. `get`/membership checks) and falling back to a generic name (or `str(result.source)`) when those keys are missing.
</issue_to_address>
### Comment 3
<location> `evals/features/common.py:193-201` </location>
<code_context>
+ for ecase in report.failures:
+ failures += f"{ecase.name}: case failure: {ecase.error_message}\n"
# iterate through evaluated cases
- for case in report.cases:
+ for ecase in report.cases:
# bool assertions
- for result in case.assertions.values():
+ for result in ecase.assertions.values():
if result.value is False:
- failures += f"{case.name}: {result.source}: {result.reason}\n"
+ failures += f"{ecase.name}: {result.name}: False"
+ if result.reason:
+ failures += f", reason: {result.reason}"
+ failures += "\n"
# evaluator failures
</code_context>
<issue_to_address>
**suggestion:** Failure messages for boolean assertions changed and no longer surface evaluator identity consistently.
The previous format logged `{case.name}: {result.source}: {result.reason}`, which let us easily tie failures to specific evaluators/checks. The new format uses `{ecase.name}: {result.name}: False[, reason: ...]` and drops `result.source`, so identical assertion names across evaluators become harder to disambiguate. Please include a stable evaluator identifier (e.g., `result.source` or a derived `eval_name_from_result`) alongside `result.name` to keep these failure logs unambiguous.
```suggestion
# iterate through evaluated cases
for ecase in report.cases:
# bool assertions
for result in ecase.assertions.values():
if result.value is False:
# include evaluator identity (result.source) to keep logs unambiguous
failures += f"{ecase.name}: {result.source}: {result.name}: False"
if result.reason:
failures += f", reason: {result.reason}"
failures += "\n"
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
7668364 to
9a59e17
Compare
... so that we use supported versions of the modules we depend on. Related: https://issues.redhat.com/browse/AEGIS-265
... to eliminate the following OTEL-related warnings:
```
<frozen abc>:106
<frozen abc>:106
<frozen abc>:106: DeprecationWarning: You should use `Logger` instead. Deprecated since version 1.39.0 and will be removed in a future release.
<frozen abc>:106
<frozen abc>:106
<frozen abc>:106: DeprecationWarning: You should use `LoggerProvider` instead. Deprecated since version 1.39.0 and will be removed in a future release.
.venv/lib64/python3.13/site-packages/opentelemetry/_events/__init__.py:201
/home/kdudka/git/aegis/.venv/lib64/python3.13/site-packages/opentelemetry/_events/__init__.py:201: DeprecationWarning: You should use `ProxyLoggerProvider` instead. Deprecated since version 1.39.0 and will be removed in a future release.
_PROXY_EVENT_LOGGER_PROVIDER = ProxyEventLoggerProvider()
```
Related: https://issues.redhat.com/browse/AEGIS-265
It is a soft keyword in Python 3.10+ used in the match/case statement. Related: https://issues.redhat.com/browse/AEGIS-265
... when a boolean assertion fails. Related: https://issues.redhat.com/browse/AEGIS-265
... when an evaluator assigns a score below the threshold. Related: https://issues.redhat.com/browse/AEGIS-265
64dcb5f to
9f96090
Compare
|
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In
evals/features/common.py,eval_name_from_resultassumes the LLM judge result structure always hassource.arguments['score']['evaluation_name']; consider using.getlookups (and a sensible fallback) to avoid KeyErrors if the underlying result schema changes. - In
cve_exclude_fields, it may be worth normalizingexclude_fieldsto an empty list whenNoneand preventing exclusion of required fields likecve_id, so that unexpected input or future feature_deps values don't causeCVEinstantiation errors.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `evals/features/common.py`, `eval_name_from_result` assumes the LLM judge result structure always has `source.arguments['score']['evaluation_name']`; consider using `.get` lookups (and a sensible fallback) to avoid KeyErrors if the underlying result schema changes.
- In `cve_exclude_fields`, it may be worth normalizing `exclude_fields` to an empty list when `None` and preventing exclusion of required fields like `cve_id`, so that unexpected input or future feature_deps values don't cause `CVE` instantiation errors.
## Individual Comments
### Comment 1
<location> `src/aegis_ai/toolsets/tools/osidb/__init__.py:93-102` </location>
<code_context>
+ )
+
+
+def cve_exclude_fields(cve: CVE, exclude_fields: List[str]):
+ """return a CVE object with data removed in fields specified by exclude_fields"""
+ # "cve_description" is used in OSIM, "description" is used in OSIDB
+ fields_to_exclude = set(
+ [field.replace("cve_description", "description") for field in exclude_fields]
+ )
+
+ # create a local copy so that we can change the CVE object
+ cve = cve.model_copy()
+ if "rh_cvss_score" in fields_to_exclude:
+ # exclude RH-provided CVSS
+ cve.cvss_scores = [cvss for cvss in cve.cvss_scores if cvss["issuer"] != "RH"]
+
+ # finally remove all fields listed in fields_to_exclude
+ filtered_dump = cve.model_dump(exclude=fields_to_exclude)
+ return CVE(**filtered_dump)
</code_context>
<issue_to_address>
**issue (bug_risk):** Handle a possible None value for exclude_fields in cve_exclude_fields to avoid runtime errors.
Since `flaw_tool` and `osidb_tool` now pass `ctx.deps.exclude_osidb_fields` directly to `cve_exclude_fields`, a `None` (or other non-list falsy value) will cause a `TypeError` when iterating or building the set. Please normalize this parameter inside `cve_exclude_fields` (e.g., `exclude_fields = exclude_fields or []`) or type it as `Optional[List[str]]` and handle `None` explicitly.
</issue_to_address>
### Comment 2
<location> `evals/features/cve/test_suggest_impact.py:36-45` </location>
<code_context>
# TODO: check whether the cvss Python module could anyhow help with this
-def score_cvss3_diff(cvss3: str, cvss3_exp: str) -> float:
- """check how a pair of CVSS 3.1 vectors differ from each other. 0.0 means completely different, 1.0 means exact match"""
+def score_cvss3_diff(cvss3: str, cvss3_exp: str) -> tuple[float, str | None]:
+ """Compare two CVSS 3.1 vectors and return (score, reason).
+ 0.0 means completely different, 1.0 means exact match.
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding direct unit tests for `score_cvss3_diff` to cover parsing and the new reason output
This function now handles parsing, metric normalization, distance computation, and human-readable reason construction, but none of this is covered by direct tests.
Please add focused unit tests in this module to cover cases such as:
- Exact match: identical vectors return `(1.0, None)`.
- Single-metric difference: e.g. only `AV` differs, score < 1 and `reason` mentions `AV`.
- Multiple metric differences: `reason` lists all mismatched metrics and the score reflects greater distance.
- Malformed/partial vectors: missing/invalid metrics either produce predictable behavior or trigger the `CVSSVectorEvaluator` exception path with a reason like `"unhandled exception: ..."`.
This will keep `CVSSVectorEvaluator` behavior reliable as vectors and formats change.
Suggested implementation:
```python
# TODO: check whether the cvss Python module could anyhow help with this
def score_cvss3_diff(cvss3: str, cvss3_exp: str) -> tuple[float, str | None]:
"""Compare two CVSS 3.1 vectors and return (score, reason).
0.0 means completely different, 1.0 means exact match.
When the score is not 1.0, reason enumerates mismatched metrics."""
if cvss3 == cvss3_exp:
# exact match
return (1.0, None)
def _parse(v: str) -> dict[str, str]:
parts = v.split("/")
metrics = tuple(scales.keys())
diffs = [_norm_dist(m) for m in metrics]
def test_score_cvss3_diff_exact_match() -> None:
base_vector = "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
score, reason = score_cvss3_diff(base_vector, base_vector)
assert score == 1.0
assert reason is None
def test_score_cvss3_diff_single_metric_difference() -> None:
base_vector = "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
# Only AV differs
changed_vector = "CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
score, reason = score_cvss3_diff(base_vector, changed_vector)
assert 0.0 < score < 1.0
assert reason is not None
# reason should mention the differing metric
assert "AV" in reason
def test_score_cvss3_diff_multiple_metric_differences() -> None:
base_vector = "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
# Multiple metrics differ (AV and AC here)
changed_vector = "CVSS:3.1/AV:L/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:H"
score_single, reason_single = score_cvss3_diff(
base_vector,
"CVSS:3.1/AV:L/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H",
)
score_multi, reason_multi = score_cvss3_diff(base_vector, changed_vector)
# More differences should not yield a higher similarity score
assert score_multi <= score_single
assert reason_multi is not None
assert "AV" in reason_multi
assert "AC" in reason_multi
def test_score_cvss3_diff_malformed_vector_unhandled_exception() -> None:
# Clearly malformed / partial vectors to exercise the exception-handling path
malformed = "not-a-cvss-vector"
base_vector = "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H"
score, reason = score_cvss3_diff(base_vector, malformed)
assert reason is not None
# The implementation is expected to wrap unexpected failures with this prefix
assert reason.startswith("unhandled exception:")
# Score should not falsely indicate a perfect match
assert score <= 1.0
```
Because I only see a fragment of `score_cvss3_diff`, you should:
1. Ensure these tests are at module scope (no indentation) alongside your other `test_...` functions in `evals/features/cve/test_suggest_impact.py`.
2. Adjust the sample CVSS vectors if your implementation expects a slightly different vector prefix or format (e.g., `CVSS:3.0` vs `CVSS:3.1`), but keep the single-metric and multi-metric differences intact.
3. If the actual error-reporting format for malformed vectors differs (for example, a different prefix than `"unhandled exception:"`), update the `assert reason.startswith("unhandled exception:")` line to match the concrete behavior of `CVSSVectorEvaluator` in your codebase.
4. If your test framework or style guidelines differ (e.g., using `unittest.TestCase` instead of plain pytest functions), wrap these assertions accordingly in your preferred testing style.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
9f96090 to
16d478f
Compare
... so that its fields can be easily excluded without unnecessary code duplication. Related: https://issues.redhat.com/browse/AEGIS-265
The one function does all approach did not work well with `osidb_cache`. When OSIDB data was cached for the first time (after fields exclusion was implemented), the data was stored with the fields excluded. Then, when the data was used for evaluation of another feature, wrong fields were provided as input for the evaluation. For legacy cached data, no fields were excluded during evaluation at all. It turned out that certain evaluation cases had been passing only by mistake. After fixing this bug, they had to be commented out to keep the CI green. Fixes: commit 4718c58 Related: https://issues.redhat.com/browse/AEGIS-265
It turned out that evals have been passing only because `osidb_cache` was feeding LLM with unfiltered data from OSIDB, which contained also the fields to be suggested. Related: https://issues.redhat.com/browse/AEGIS-265
It turned out that evals have been passing only because `osidb_cache` was feeding LLM with unfiltered data from OSIDB, which contained also the fields to be suggested. Fixes: commit 2b9fa21 Related: https://issues.redhat.com/browse/AEGIS-265
It turned out that evals have been passing only because `osidb_cache` was feeding LLM with unfiltered data from OSIDB, which contained also the fields to be suggested. Fixes: commit 2b9fa21 Related: https://issues.redhat.com/browse/AEGIS-265
... that are known to fail. Evaluate them, count their score in the overall average but do not fail if their score is below the specified threshold. This way we can monitor improvements (or regressions) even on the cases that Aegis does not handle perfectly for now. Related: https://issues.redhat.com/browse/AEGIS-265
... so that it is easier to annotate individual evaluation cases. No changes in behavior intended with this commit. Related: https://issues.redhat.com/browse/AEGIS-265
16d478f to
7e9bd01
Compare
Evaluate them, count their score in the overall average but do not fail if their score is below the specified threshold. This makes unexpected failures easier to spot in the CI logs. Related: https://issues.redhat.com/browse/AEGIS-265
... so that the evaluation cases definitions are easier to maintain. No changes in behavior intended with this commit. Related: https://issues.redhat.com/browse/AEGIS-265
Evaluate them, count their score in the overall average but do not fail if their score is below the specified threshold. This makes unexpected failures easier to spot in the CI logs. Related: https://issues.redhat.com/browse/AEGIS-265
... of the `suggest-description-for-CVE-2025-65073` evaluation case: ``` FAILED evals/features/cve/test_suggest_description.py::test_eval_suggest_description - AssertionError: Unsatisfied assertion(s): suggest-description-for-CVE-2025-65073: NoVersionInfo: False, reason: The 'suggested_description' contains '/v3/' in the API endpoints '/v3/ec2tokens' and '/v3/s3tokens', which represents a version and is not part of an acronym or its explanation. ``` This is obviously a problem with `LLMJudge` rather than a problem with the suggestion provided by Aegis. Related: https://issues.redhat.com/browse/AEGIS-265
Now as we annotate known-to-fail evaluators for each evaluation case individually, we do not need to ignore overall failures on the Aegis feature level any more. Related: https://issues.redhat.com/browse/AEGIS-265
7e9bd01 to
d55e99d
Compare
What
uv.locklogfirepinned dependency to eliminate OTEL-related warningsSuggestCweEvaluatorscoreCVSSVectorEvaluatorscoresuggest-impactLLM prompt templatesuggest-statementLLM prompt templateStatementEvaluatorrubricAEGIS_EVALS_MIN_PASSEDany moreIt turned out that certain evaluation cases had been passing only by mistake. After making the field exclusion work as expected in the evaluation suite, we had to apply a couple of fixes (and annotate some evaluation cases as known to fail) to make the CI green again.
Why
The one function does all approach did not work well with
osidb_cache.When OSIDB data was cached for the first time (after fields exclusion was implemented), the data was stored with the fields excluded. Then, when the data was used for evaluation of another feature, wrong fields were provided as input for the evaluation. For legacy cached data, no fields were excluded during evaluation at all.
How to Test
See the CI results.
Related Tickets
Fixes: commit 4718c58
Fixes: commit 2b9fa21
Related: https://issues.redhat.com/browse/AEGIS-265
Summary by Sourcery
Decouple OSIDB CVE field exclusion from data retrieval and update CVE-related LLM prompts and evaluation harnesses to produce clearer outputs and more reliable, annotated eval results.
Bug Fixes:
Enhancements:
Build:
CI:
Tests: