-
Notifications
You must be signed in to change notification settings - Fork 72
Evaluation #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation #122
Conversation
…m input parsing for readability (see readme)
a0b5f88 to
a3a117f
Compare
…tition evaluation: response spread and consistency. based on VAPR logic
…dme updated to reflect this. All eval functions are in a new eval directory. Custom evaluator registrations are moved to a new eval_registry.py file.
ashsong-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Adding some initial feedback. Let's sync on custom dataset format.
ashsong-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work, Katherine! The README is incredibly thorough, informative, and clear. These examples will serve as a great guide for users.
Mostly adding minor comments for refinement. I think we're good to merge after this!
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Co-authored-by: Ashley Song <[email protected]>
Updated Config files:
src/vuln_analysis/configs/config-eval.yml: separate eval-specific config that shows theaccuracyandconsistencycustom evaluators. Additionally, amax_concurrencyparameter was added to the cve_checklist, cve_summarize, and cve_justify pipeline stages to help avoid429 Too Many Requesterrors when running eval with a large test set, or with multiple repetitions (the modified code can be found incve_checklist.py,cve_justify.py, andcve_summarize.pyin in src/vuln_analysis/functions).src/vuln_analysis/configs/config-tracing.ymlandsrc/vuln_analysis/configs/config.yml: Renamed the profiler dataset fromeval_dataset.jsontoprofiler_dataset.jsonto not be misleading with the newly added evaluation workflow.Eval example test sets to show eval input format:
src/vuln_analysis/data/eval_datasets/common-test-set.json: shows how multiple containers can be in one test set json (as long as both containers test the same set of CVEs. Note that ground truth for the morpheus-24.03 container is dummy data.src/vuln_analysis/data/eval_datasets/morpheus-23.11-test-set.json: morpheus 23.11 specific test set -- ground truth is based on our hand-labeled datasetsrc/vuln_analysis/data/eval_datasets/morpheus-24.03-test-set.json: morpheus 24.03 specific test set -- ground truth is dummy data.Functions to handle eval input:
src/vuln_analysis/data_models/eval_input.py: data model for eval input. Borrows from the preexisting image input data model to validate container metadata. Then it also verifies that for all containers listed in a test set json, the containers share the same CVE test set.src/vuln_analysis/eval/parse_eval_input.py: parse the custom eval input json so that it is digestible by NAT's eval harness (aka transform the human readable json into an EvalInputItem format). NAT is told to use this parsing script via the config option eval.general.dataset.functionCustom evaluators:
src/vuln_analysis/eval/evaluators/accuracy.py: configurable accuracy evaluator. Config options:field: run accuracy on label or status fieldmultiple_reps: True/False. If True, change the eval output file to output multiple metrics used to construct a box and whiskers plot that captures the distribution of runs instead of a single average.src/vuln_analysis/eval/visualizations/box_and_whisker_plot.pyis a script to consume this output (not part of the main nat eval run).src/vuln_analysis/eval/visualizations/example_plot.pngshows an example of the output of this script using dummy data.conflict_policy: Defines how to handle CVEs in the eval test set that have duplicate/conflicting ground truth labels. Options includelenient,keep_first, andstrict.lenientmeans if CVE agent picks any of the duplicate answers in the test set, mark as accurate.keep_firstmeans use the first CVE ground truth encountered in the test set as the truth for that CVE.strictmeans drop all CVEs that have conflicting labels (makes the test set smaller).src/vuln_analysis/eval/evaluators/consistency.py: configurable evaluator for measuring consistency of answers per-CVE across multiple repetitions. Config options:field: run consistency on label or status fieldsrc/vuln_analysis/register.py: added registration imports for the two custom evaluators.Misc:
src/vuln_analysis/utils/checklist_prompt_generator.py: When I was testing eval runs, I was running into parsing errors that crashed the pipeline. Making the parsing of the generated checklist more robust fixed these errors.WIP:
response_similarityevaluator (Shawn's VAPR response spread metric for the summary field)