Skip to content

Imaging-Plaza/git-metadata-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

458 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Git Metadata Extractor

Turn a GitHub URL into a SHACL-validated Open Pulse Ontology graph β€” repositories, the people who built them, the organizations behind them, and the papers they cite.

🌐 In production at


What it does

                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
github.com/X  β†’  β”‚  /v2/extract                 β”‚  β†’  JSON-LD graph
                 β”‚   classify β†’ gather context  β”‚
                 β”‚   β†’ root + fan-out agents    β”‚     - schema:SoftwareSourceCode
                 β”‚   β†’ reconcile + resolve ROR  β”‚     - schema:Person
                 β”‚   β†’ SHACL validate           β”‚     - org:Organization
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     - org:Membership
                                                     - pulse:Contribution
                                                     - schema:ScholarlyArticle

A single GitHub URL β†’ a typed graph you can SPARQL. The pipeline combines deterministic provider lookups (GitHub REST, ORCID, ROR, Infoscience, GIMIE), nine RAG indices, and optional LLM agents.


Try it in 30 seconds

# 1. Setup
git clone https://github.com/Imaging-Plaza/git-metadata-extractor.git
cd git-metadata-extractor
just install-dev
cp .env.example .env   # fill in GME_GITHUB_TOKEN + one LLM credential

# 2. Run
just serve-dev         # starts on http://localhost:1234

# 3. Extract
curl "http://localhost:1234/v2/extract/github.com/Imaging-Plaza/git-metadata-extractor?output_format=jsonld"

You'll get a JSON-LD @graph like:

{
  "@context": "https://open-pulse.epfl.ch/ontology/v2.1.2.jsonld",
  "@graph": [
    {
      "@id": "urn:pulse:Imaging-Plaza/git-metadata-extractor",
      "@type": "schema:SoftwareSourceCode",
      "schema:name": "git-metadata-extractor",
      "schema:license": "https://spdx.org/licenses/MIT.html",
      "schema:author": [
        { "@id": "urn:pulse:caviri" }
      ],
      "pulse:githubRepositoryHandle": "Imaging-Plaza/git-metadata-extractor"
    },
    {
      "@id": "urn:pulse:caviri",
      "@type": "schema:Person",
      "schema:name": "Carlos Vivar Rios",
      "org:hasMembership": [
        { "@id": "urn:pulse:caviri__https://ror.org/02hdt9m26" }
      ]
    },
    {
      "@id": "urn:pulse:caviri__https://ror.org/02hdt9m26",
      "@type": "org:Membership",
      "org:organization": { "@id": "https://ror.org/02hdt9m26" }
    },
    {
      "@id": "https://ror.org/02hdt9m26",
      "@type": "org:Organization",
      "schema:name": "Swiss Data Science Center"
    }
  ]
}

That person β†’ membership β†’ organization chain was inferred by the ROR resolver stages β€” the pipeline reads _company: "@SwissDataScienceCenter" from GitHub, hits the ROR index, and materializes a proper org:Membership triple. See docs/v2-pipeline.md for the full stage walkthrough.


More examples

# Async extraction (returns a job id)
curl -X POST http://localhost:1234/v2/extract \
     -H "Content-Type: application/json" \
     -d '{"url": "https://github.com/epfl-llm/meditron-7b"}'
# {"job_id": "0193ab12-...", "status": "queued"}

curl http://localhost:1234/v2/jobs/0193ab12-...

# Extract a person profile (fans out to their owned repos)
curl "http://localhost:1234/v2/extract/github.com/caviri"

# Extract an organization
curl "http://localhost:1234/v2/extract/github.com/Imaging-Plaza"

# Surface the internal pipeline fields too (gme-internal: + publiccode: namespaces)
curl "http://localhost:1234/v2/extract/github.com/X?include_internal_fields=true"

# Switch runtime per request
curl "http://localhost:1234/v2/extract/github.com/X?agent_runtime=rule_based"

Swagger UI: http://localhost:1234/docs


Documentation

Doc What's in it
docs/v2-pipeline.md Pipeline overview, load-bearing assumptions, affiliation strategy, env flags. Start here.
docs/getting-started.md Install + first run, the long version
docs/v2-api-reference.md /v2/extract, /v2/jobs, /v2/graph endpoints
docs/rag-indices.md Nine RAG indices + federated layer
docs/v2-rag-tools.md Agent-side RAG tools wired into the pipeline
docs/migration-v1-to-v2.md /v1 β†’ /v2 endpoint mapping
.env.example Every env var with defaults and notes

Versioned doc site: https://imaging-plaza.github.io/git-metadata-extractor/


Repository layout

src/v2/                  # v2 extraction pipeline (new work here)
  api.py                 # /v2/extract endpoint
  pipeline/stages/       # 25 sequential pipeline stages
  agents/llm/            # LLM-backed entity agents + RAG tools
  agents/rule_based/     # deterministic counterparts
  ingest/providers/      # GitHub, ROR, ORCID, Infoscience clients
  schema/                # JSON Schema + JSON-LD context + Pydantic models
  validation/            # strict-schema + SHACL validators

src/index/               # nine RAG indices (HuggingFace, OpenAlex, Infoscience,
                         # ORCID, ROR, Zenodo, ETHZ, GitHub, SNSF) + federated

src/v1/                  # frozen legacy pipeline β€” no new work
tests/v2/                # default test target
docs/                    # MkDocs site source

Configuration

Everything is in .env β€” copy .env.example and fill in what you need. Required minimum:

  • GME_GITHUB_TOKEN β€” required for any real GitHub call.
  • One LLM credential β€” RCP_TOKEN (EPFL), OPENAI_API_KEY, or OPENROUTER_API_KEY.

Optional (only when you use the feature): INFOSCIENCE_TOKEN, SELENIUM_REMOTE_URL, HF_TOKEN, ZENODO_TOKEN, OPENALEX_MAILTO, EPFL_GRAPH_USERNAME / EPFL_GRAPH_PASSWORD.

All ~40 env vars (with defaults + per-feature explanations) are in .env.example.


Testing

just test              # fast loop via testmon (recompiles only what changed)
just test-full         # full deterministic run
just lint              # ruff
just type-check        # mypy
just ci                # lint + type-check + coverage

Per-index suites: just hf-test, just orcid-test, just openalex-test, etc.


Docker

docker build -t git-metadata-extractor -f tools/image/Dockerfile .
docker run -it --rm --env-file .env -p 1234:1234 \
    -v ./data:/app/data --name gme --network dev \
    git-metadata-extractor

Selenium (for the link-veracity stage) and Qdrant (for the RAG indices) wire up through the devcontainer compose file. See docs/getting-started.md for the full setup.


Credits

  • Quentin Chappuis β€” EPFL Center for Imaging
  • Robin Franken β€” SDSC
  • Carlos Vivar Rios β€” SDSC / EPFL Center for Imaging

Built at SDSC and the EPFL Center for Imaging.