Skip to content

Add OSV recidivism enrichment pipeline, repository mirror tooling, and INI-based local configuration#1

Merged
andymeneely merged 7 commits into
masterfrom
copilot/add-osv-data-dump-scripts
May 12, 2026
Merged

Add OSV recidivism enrichment pipeline, repository mirror tooling, and INI-based local configuration#1
andymeneely merged 7 commits into
masterfrom
copilot/add-osv-data-dump-scripts

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 12, 2026

This PR adds scripts for OSV-based research workflows: enriching vulnerability records with a recidivism-derived severity metric and materializing local clones of referenced source repositories. It also introduces shared parsing/scoring utilities, INI-driven local configuration, and focused unit coverage for both metric and configuration primitives.

  • OSV ingestion + recidivism enrichment

    • Added scripts/enrich_osv_recidivism.py to:
      • download/extract OSV dump archives,
      • compute recidivism context per vulnerability (CWE recurrence, repo recurrence, fix commits),
      • write enriched JSONL output with database_specific.recidivism,
      • append normalized severity entries: RECIDIVISM and RECIDIVISM_ADJUSTED.
    • Handles overwrite/dedup behavior explicitly for pre-existing recidivism fields/severity entries.
  • Shared vulnerability analysis primitives

    • Added scripts/osv_common.py with reusable functions for:
      • CWE extraction,
      • GitHub repo normalization from references,
      • fix-commit extraction from OSV ranges and commit URLs,
      • base severity parsing (excluding recidivism-derived synthetic entries),
      • bounded adjusted severity computation (0.0..10.0),
      • aggregate history collection used by recidivism scoring.
  • Repository cloning for cluster-local mirrors

    • Added scripts/clone_osv_repositories.py to discover GitHub repos in OSV references and clone/update them locally.
    • Clone layout is namespaced as <target-dir>/<owner>/<repo> to avoid cross-owner name collisions.
    • Pull/clone failures are surfaced with contextual warnings rather than silent skips.
  • INI-based local configuration

    • Added tracked defaults in recidivism.default.ini.
    • Added scripts/recidivism_config.py to load settings from recidivism.ini with fallback to defaults.
    • Updated .gitignore to ignore recidivism.ini.
    • Both scripts now read paths/options from config, with CLI arguments available as overrides.
    • When recidivism.ini is missing, scripts print guidance to copy and edit recidivism.default.ini.
  • Documentation + focused tests

    • Updated README.md with config setup and script usage.
    • Added tests/test_osv_common.py covering extraction behavior and recidivism score math edge cases.
    • Added tests/test_recidivism_config.py covering fallback behavior, path resolution, and required-value validation.
cp recidivism.default.ini recidivism.ini

python scripts/enrich_osv_recidivism.py \
  --output data/osv_recidivism.jsonl

python scripts/clone_osv_repositories.py \
  --osv-dir data/osv_dump \
  --target-dir data/repos \
  --update-existing
Original prompt

I want to write some scripts that will download the OSV data dump, iterate through the vulnerabilities, and add a custom metric called "recidivism" to the severity score. We'll be writing scripts that look through the vulnerabilities to find CWE information and git fix commit information to calculate recidivism from Brandon Keller's dissertation. I also want a script that will clone all the repositories referenced by the OSV locally so we can keep local clones on our research cluster.

Copilot AI changed the title [WIP] Add scripts to download OSV data and calculate recidivism Add OSV recidivism enrichment pipeline and repository mirror tooling May 12, 2026
Copilot AI requested a review from andymeneely May 12, 2026 15:17
@andymeneely
Copy link
Copy Markdown
Contributor

@copilot Have the scripts get their settings from a file called "recidivism.ini" that will have any local locations of inputs and outputs. Git should ignore recidivism.ini but we should have a recidivism.default.ini that has everything we need to configure. If you run the code without a recidivism.ini, output a message about what they should do to configure this locally

@andymeneely
Copy link
Copy Markdown
Contributor

@copilot Don't actually try to calculate recidivism - instead put a placeholder there and remove the code that was guessing what vulnerability recidivism actually is.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 12, 2026

@copilot Have the scripts get their settings from a file called "recidivism.ini" that will have any local locations of inputs and outputs. Git should ignore recidivism.ini but we should have a recidiv...

Implemented in commit d146354. Both scripts now load settings from recidivism.ini, with a tracked recidivism.default.ini template and .gitignore updated to ignore recidivism.ini. If recidivism.ini is missing, the scripts print a setup message telling users to copy and edit the default config. README now documents this configuration flow.

@andymeneely andymeneely marked this pull request as ready for review May 12, 2026 15:28
Copilot AI changed the title Add OSV recidivism enrichment pipeline and repository mirror tooling Add OSV recidivism enrichment pipeline, repository mirror tooling, and INI-based local configuration May 12, 2026
Copy link
Copy Markdown
Contributor

@andymeneely andymeneely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the ballpark of what I'm looking for in terms of making our own copy of OSV data with recidivism in it. Let's start with running these and refining them.

@andymeneely andymeneely merged commit 3410ef2 into master May 12, 2026
1 check failed
@andymeneely andymeneely deleted the copilot/add-osv-data-dump-scripts branch May 12, 2026 15:30
Copilot stopped work on behalf of andymeneely due to an error May 12, 2026 15:30
Copilot AI requested a review from andymeneely May 12, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants