Evaluation #122

katherineh123 · 2025-08-29T21:53:30Z

Updated Config files:
src/vuln_analysis/configs/config-eval.yml : separate eval-specific config that shows the accuracy and consistency custom evaluators. Additionally, a max_concurrency parameter was added to the cve_checklist, cve_summarize, and cve_justify pipeline stages to help avoid 429 Too Many Request errors when running eval with a large test set, or with multiple repetitions (the modified code can be found in cve_checklist.py, cve_justify.py, and cve_summarize.py in in src/vuln_analysis/functions).
src/vuln_analysis/configs/config-tracing.yml and src/vuln_analysis/configs/config.yml : Renamed the profiler dataset from eval_dataset.json to profiler_dataset.json to not be misleading with the newly added evaluation workflow.

Eval example test sets to show eval input format:
src/vuln_analysis/data/eval_datasets/common-test-set.json : shows how multiple containers can be in one test set json (as long as both containers test the same set of CVEs. Note that ground truth for the morpheus-24.03 container is dummy data.
src/vuln_analysis/data/eval_datasets/morpheus-23.11-test-set.json: morpheus 23.11 specific test set -- ground truth is based on our hand-labeled dataset
src/vuln_analysis/data/eval_datasets/morpheus-24.03-test-set.json : morpheus 24.03 specific test set -- ground truth is dummy data.

Functions to handle eval input:
src/vuln_analysis/data_models/eval_input.py : data model for eval input. Borrows from the preexisting image input data model to validate container metadata. Then it also verifies that for all containers listed in a test set json, the containers share the same CVE test set.
src/vuln_analysis/eval/parse_eval_input.py : parse the custom eval input json so that it is digestible by NAT's eval harness (aka transform the human readable json into an EvalInputItem format). NAT is told to use this parsing script via the config option eval.general.dataset.function

Custom evaluators:
src/vuln_analysis/eval/evaluators/accuracy.py: configurable accuracy evaluator. Config options:

field: run accuracy on label or status field
multiple_reps : True/False. If True, change the eval output file to output multiple metrics used to construct a box and whiskers plot that captures the distribution of runs instead of a single average. src/vuln_analysis/eval/visualizations/box_and_whisker_plot.py is a script to consume this output (not part of the main nat eval run). src/vuln_analysis/eval/visualizations/example_plot.png shows an example of the output of this script using dummy data.
conflict_policy: Defines how to handle CVEs in the eval test set that have duplicate/conflicting ground truth labels. Options include lenient, keep_first, and strict. lenient means if CVE agent picks any of the duplicate answers in the test set, mark as accurate. keep_first means use the first CVE ground truth encountered in the test set as the truth for that CVE. strict means drop all CVEs that have conflicting labels (makes the test set smaller).

src/vuln_analysis/eval/evaluators/consistency.py: configurable evaluator for measuring consistency of answers per-CVE across multiple repetitions. Config options:

field: run consistency on label or status field
src/vuln_analysis/register.py : added registration imports for the two custom evaluators.

Misc:
src/vuln_analysis/utils/checklist_prompt_generator.py: When I was testing eval runs, I was running into parsing errors that crashed the pipeline. Making the parsing of the generated checklist more robust fixed these errors.

WIP:

Add response_similarity evaluator (Shawn's VAPR response spread metric for the summary field)
Add LLM as a judge on the summary field

…m input parsing for readability (see readme)

…_accuracy

…tition evaluation: response spread and consistency. based on VAPR logic

…dme updated to reflect this. All eval functions are in a new eval directory. Custom evaluator registrations are moved to a new eval_registry.py file.

ashsong-nv

Nice work! Adding some initial feedback. Let's sync on custom dataset format.

src/vuln_analysis/eval/eval_consistency.py

src/vuln_analysis/eval/eval_label_accuracy.py

src/vuln_analysis/eval/eval_consistency.py

src/vuln_analysis/configs/config-eval.yml

src/vuln_analysis/data/eval_datasets/profiler_dataset.json

src/vuln_analysis/configs/config-eval.yml

ashsong-nv

Amazing work, Katherine! The README is incredibly thorough, informative, and clear. These examples will serve as a great guide for users.

Mostly adding minor comments for refinement. I think we're good to merge after this!

README.md

Co-authored-by: Ashley Song <[email protected]>

…tasets

Katherine Huang added 5 commits September 3, 2025 15:15

wip evaluation: custom nat evaluation for simple accuracy, with custo…

7bb5b6d

…m input parsing for readability (see readme)

putting back error handling info in readme that got deleted by accident

bf613be

removed wrong comment on accuracy numbers

3c66a49

readme update for running multiple eval runs

7ffb912

add to readme how to run multiple reps of eval

a3a117f

katherineh123 force-pushed the wip-evaluation branch from a0b5f88 to a3a117f Compare September 3, 2025 22:17

katherineh123 added 20 commits September 3, 2025 15:27

test commit

2fc3535

test commit

530f918

nat custom evaluator for label accuracy

f1d24f1

renamed eval custom input parser

9d8a677

changed status accuracy evaluator name to status_accuracy from simple…

e3617d4

…_accuracy

function renames + update readme with label accuracy evaluator

5763947

readme typo

22c0910

more README typos

77639c7

implemented more in-depth evaluators for within-experiment multi repe…

5988743

…tition evaluation: response spread and consistency. based on VAPR logic

new eval dataset input file format

65b6a8e

Ground truth input file structure changed to list all containers; rea…

ef053df

…dme updated to reflect this. All eval functions are in a new eval directory. Custom evaluator registrations are moved to a new eval_registry.py file.

fix readme description for within-experiment custom evaluators

6e4f145

readme typo

26a257f

readme typo

c7ad0f6

removed alikeability word from comments, calling it consistency instead

6275d52

readme typo

d0a087c

fix model endpoints

59d2502

change aiq imports to nat

fb21f6e

stop tracking old_eval_datasets

8397ef7

small changes to custom evaluator examples

6cf2249

ashsong-nv reviewed Sep 10, 2025

View reviewed changes

katherineh123 added 3 commits September 10, 2025 13:17

slight file structure and naming changes based on pr comments

1deaf90

moved evaluators to evaluator dir

ee16609

modified gitignore to removed .tmp files

fd45a3b

ashsong-nv approved these changes Nov 2, 2025

View reviewed changes

hsin-c approved these changes Nov 3, 2025

View reviewed changes

katherineh123 and others added 27 commits November 3, 2025 11:13

remove max concurrency from readme

8a17dbb

Update README.md

fa20933

Co-authored-by: Ashley Song <[email protected]>

Update README.md

e41d49e

Co-authored-by: Ashley Song <[email protected]>

Update README.md

8ff40d3

Co-authored-by: Ashley Song <[email protected]>

Update README.md

e08aac9

Co-authored-by: Ashley Song <[email protected]>

Update README.md

5aa552e

Co-authored-by: Ashley Song <[email protected]>

Update src/vuln_analysis/data/eval_datasets/eval_dataset.json

a51d5ed

Co-authored-by: Ashley Song <[email protected]>

Update src/vuln_analysis/eval/parse_eval_input.py

b0051a7

Co-authored-by: Ashley Song <[email protected]>

Update README.md

f3b82f5

Co-authored-by: Ashley Song <[email protected]>

Update README.md

e560b6c

Co-authored-by: Ashley Song <[email protected]>

Update README.md

ff48ab6

Co-authored-by: Ashley Song <[email protected]>

Update README.md

622ea09

Co-authored-by: Ashley Song <[email protected]>

Update README.md

516f187

Co-authored-by: Ashley Song <[email protected]>

Update README.md

21f1cbc

Co-authored-by: Ashley Song <[email protected]>

Update README.md

c7d20be

Co-authored-by: Ashley Song <[email protected]>

Update README.md

0644ab7

Co-authored-by: Ashley Song <[email protected]>

Update README.md

5347c96

Co-authored-by: Ashley Song <[email protected]>

Update README.md

e795af7

Co-authored-by: Ashley Song <[email protected]>

Update README.md

d777265

Co-authored-by: Ashley Song <[email protected]>

Update README.md

110aada

Co-authored-by: Ashley Song <[email protected]>

Update README.md

4233d42

Co-authored-by: Ashley Song <[email protected]>

minor readme updates

88d60f8

update eval dataset container names to match example input message da…

5f84e46

…tasets

example output for accuracy evaluator with multiple runs

6c02d5d

minor readme

ef1a856

README cleanup

631eae4

Fix JSON syntax highlighting in README

54b0cc3

ashsong-nv merged commit 46dd4e4 into NVIDIA-AI-Blueprints:develop Nov 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation #122

Evaluation #122

Uh oh!

katherineh123 commented Aug 29, 2025 •

edited

Loading

Uh oh!

ashsong-nv left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashsong-nv left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Evaluation #122

Evaluation #122

Uh oh!

Conversation

katherineh123 commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashsong-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashsong-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

katherineh123 commented Aug 29, 2025 •

edited

Loading