Skip to content

Conversation

@katherineh123
Copy link
Collaborator

@katherineh123 katherineh123 commented Aug 29, 2025

Updated Config files:
src/vuln_analysis/configs/config-eval.yml : separate eval-specific config that shows the accuracy and consistency custom evaluators. Additionally, a max_concurrency parameter was added to the cve_checklist, cve_summarize, and cve_justify pipeline stages to help avoid 429 Too Many Request errors when running eval with a large test set, or with multiple repetitions (the modified code can be found in cve_checklist.py, cve_justify.py, and cve_summarize.py in in src/vuln_analysis/functions).
src/vuln_analysis/configs/config-tracing.yml and src/vuln_analysis/configs/config.yml : Renamed the profiler dataset from eval_dataset.json to profiler_dataset.json to not be misleading with the newly added evaluation workflow.

Eval example test sets to show eval input format:
src/vuln_analysis/data/eval_datasets/common-test-set.json : shows how multiple containers can be in one test set json (as long as both containers test the same set of CVEs. Note that ground truth for the morpheus-24.03 container is dummy data.
src/vuln_analysis/data/eval_datasets/morpheus-23.11-test-set.json: morpheus 23.11 specific test set -- ground truth is based on our hand-labeled dataset
src/vuln_analysis/data/eval_datasets/morpheus-24.03-test-set.json : morpheus 24.03 specific test set -- ground truth is dummy data.

Functions to handle eval input:
src/vuln_analysis/data_models/eval_input.py : data model for eval input. Borrows from the preexisting image input data model to validate container metadata. Then it also verifies that for all containers listed in a test set json, the containers share the same CVE test set.
src/vuln_analysis/eval/parse_eval_input.py : parse the custom eval input json so that it is digestible by NAT's eval harness (aka transform the human readable json into an EvalInputItem format). NAT is told to use this parsing script via the config option eval.general.dataset.function

Custom evaluators:
src/vuln_analysis/eval/evaluators/accuracy.py: configurable accuracy evaluator. Config options:

  • field: run accuracy on label or status field
  • multiple_reps : True/False. If True, change the eval output file to output multiple metrics used to construct a box and whiskers plot that captures the distribution of runs instead of a single average. src/vuln_analysis/eval/visualizations/box_and_whisker_plot.py is a script to consume this output (not part of the main nat eval run). src/vuln_analysis/eval/visualizations/example_plot.png shows an example of the output of this script using dummy data.
  • conflict_policy: Defines how to handle CVEs in the eval test set that have duplicate/conflicting ground truth labels. Options include lenient, keep_first, and strict. lenient means if CVE agent picks any of the duplicate answers in the test set, mark as accurate. keep_first means use the first CVE ground truth encountered in the test set as the truth for that CVE. strict means drop all CVEs that have conflicting labels (makes the test set smaller).

src/vuln_analysis/eval/evaluators/consistency.py: configurable evaluator for measuring consistency of answers per-CVE across multiple repetitions. Config options:

  • field: run consistency on label or status field
  • src/vuln_analysis/register.py : added registration imports for the two custom evaluators.

Misc:
src/vuln_analysis/utils/checklist_prompt_generator.py: When I was testing eval runs, I was running into parsing errors that crashed the pipeline. Making the parsing of the generated checklist more robust fixed these errors.

WIP:

  1. Add response_similarity evaluator (Shawn's VAPR response spread metric for the summary field)
  2. Add LLM as a judge on the summary field

Copy link
Collaborator

@ashsong-nv ashsong-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Adding some initial feedback. Let's sync on custom dataset format.

Copy link
Collaborator

@ashsong-nv ashsong-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, Katherine! The README is incredibly thorough, informative, and clear. These examples will serve as a great guide for users.

Mostly adding minor comments for refinement. I think we're good to merge after this!

katherineh123 and others added 27 commits November 3, 2025 11:13
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
@ashsong-nv ashsong-nv merged commit 46dd4e4 into NVIDIA-AI-Blueprints:develop Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants