Turn a GitHub URL into a SHACL-validated Open Pulse Ontology graph β repositories, the people who built them, the organizations behind them, and the papers they cite.
π In production at
- imagingplaza.epfl.ch β discovery portal for EPFL imaging software.
- openpulse.science β broader EPFL/Swiss open-science software graph.
ββββββββββββββββββββββββββββββββ
github.com/X β β /v2/extract β β JSON-LD graph
β classify β gather context β
β β root + fan-out agents β - schema:SoftwareSourceCode
β β reconcile + resolve ROR β - schema:Person
β β SHACL validate β - org:Organization
ββββββββββββββββββββββββββββββββ - org:Membership
- pulse:Contribution
- schema:ScholarlyArticle
A single GitHub URL β a typed graph you can SPARQL. The pipeline combines deterministic provider lookups (GitHub REST, ORCID, ROR, Infoscience, GIMIE), nine RAG indices, and optional LLM agents.
# 1. Setup
git clone https://github.com/Imaging-Plaza/git-metadata-extractor.git
cd git-metadata-extractor
just install-dev
cp .env.example .env # fill in GME_GITHUB_TOKEN + one LLM credential
# 2. Run
just serve-dev # starts on http://localhost:1234
# 3. Extract
curl "http://localhost:1234/v2/extract/github.com/Imaging-Plaza/git-metadata-extractor?output_format=jsonld"You'll get a JSON-LD @graph like:
{
"@context": "https://open-pulse.epfl.ch/ontology/v2.1.2.jsonld",
"@graph": [
{
"@id": "urn:pulse:Imaging-Plaza/git-metadata-extractor",
"@type": "schema:SoftwareSourceCode",
"schema:name": "git-metadata-extractor",
"schema:license": "https://spdx.org/licenses/MIT.html",
"schema:author": [
{ "@id": "urn:pulse:caviri" }
],
"pulse:githubRepositoryHandle": "Imaging-Plaza/git-metadata-extractor"
},
{
"@id": "urn:pulse:caviri",
"@type": "schema:Person",
"schema:name": "Carlos Vivar Rios",
"org:hasMembership": [
{ "@id": "urn:pulse:caviri__https://ror.org/02hdt9m26" }
]
},
{
"@id": "urn:pulse:caviri__https://ror.org/02hdt9m26",
"@type": "org:Membership",
"org:organization": { "@id": "https://ror.org/02hdt9m26" }
},
{
"@id": "https://ror.org/02hdt9m26",
"@type": "org:Organization",
"schema:name": "Swiss Data Science Center"
}
]
}That person β membership β organization chain was inferred by the ROR resolver stages β the pipeline reads _company: "@SwissDataScienceCenter" from GitHub, hits the ROR index, and materializes a proper org:Membership triple. See docs/v2-pipeline.md for the full stage walkthrough.
# Async extraction (returns a job id)
curl -X POST http://localhost:1234/v2/extract \
-H "Content-Type: application/json" \
-d '{"url": "https://github.com/epfl-llm/meditron-7b"}'
# {"job_id": "0193ab12-...", "status": "queued"}
curl http://localhost:1234/v2/jobs/0193ab12-...
# Extract a person profile (fans out to their owned repos)
curl "http://localhost:1234/v2/extract/github.com/caviri"
# Extract an organization
curl "http://localhost:1234/v2/extract/github.com/Imaging-Plaza"
# Surface the internal pipeline fields too (gme-internal: + publiccode: namespaces)
curl "http://localhost:1234/v2/extract/github.com/X?include_internal_fields=true"
# Switch runtime per request
curl "http://localhost:1234/v2/extract/github.com/X?agent_runtime=rule_based"Swagger UI: http://localhost:1234/docs
| Doc | What's in it |
|---|---|
| docs/v2-pipeline.md | Pipeline overview, load-bearing assumptions, affiliation strategy, env flags. Start here. |
| docs/getting-started.md | Install + first run, the long version |
| docs/v2-api-reference.md | /v2/extract, /v2/jobs, /v2/graph endpoints |
| docs/rag-indices.md | Nine RAG indices + federated layer |
| docs/v2-rag-tools.md | Agent-side RAG tools wired into the pipeline |
| docs/migration-v1-to-v2.md | /v1 β /v2 endpoint mapping |
| .env.example | Every env var with defaults and notes |
Versioned doc site: https://imaging-plaza.github.io/git-metadata-extractor/
src/v2/ # v2 extraction pipeline (new work here)
api.py # /v2/extract endpoint
pipeline/stages/ # 25 sequential pipeline stages
agents/llm/ # LLM-backed entity agents + RAG tools
agents/rule_based/ # deterministic counterparts
ingest/providers/ # GitHub, ROR, ORCID, Infoscience clients
schema/ # JSON Schema + JSON-LD context + Pydantic models
validation/ # strict-schema + SHACL validators
src/index/ # nine RAG indices (HuggingFace, OpenAlex, Infoscience,
# ORCID, ROR, Zenodo, ETHZ, GitHub, SNSF) + federated
src/v1/ # frozen legacy pipeline β no new work
tests/v2/ # default test target
docs/ # MkDocs site source
Everything is in .env β copy .env.example and fill in what you need. Required minimum:
GME_GITHUB_TOKENβ required for any real GitHub call.- One LLM credential β
RCP_TOKEN(EPFL),OPENAI_API_KEY, orOPENROUTER_API_KEY.
Optional (only when you use the feature): INFOSCIENCE_TOKEN, SELENIUM_REMOTE_URL, HF_TOKEN, ZENODO_TOKEN, OPENALEX_MAILTO, EPFL_GRAPH_USERNAME / EPFL_GRAPH_PASSWORD.
All ~40 env vars (with defaults + per-feature explanations) are in .env.example.
just test # fast loop via testmon (recompiles only what changed)
just test-full # full deterministic run
just lint # ruff
just type-check # mypy
just ci # lint + type-check + coveragePer-index suites: just hf-test, just orcid-test, just openalex-test, etc.
docker build -t git-metadata-extractor -f tools/image/Dockerfile .
docker run -it --rm --env-file .env -p 1234:1234 \
-v ./data:/app/data --name gme --network dev \
git-metadata-extractorSelenium (for the link-veracity stage) and Qdrant (for the RAG indices) wire up through the devcontainer compose file. See docs/getting-started.md for the full setup.
- Quentin Chappuis β EPFL Center for Imaging
- Robin Franken β SDSC
- Carlos Vivar Rios β SDSC / EPFL Center for Imaging
Built at SDSC and the EPFL Center for Imaging.