-
Notifications
You must be signed in to change notification settings - Fork 6
API cBioPortal
The cBioPortal module provides a Python API for downloading cancer study data from cBioPortal via its public REST API (v3).
If you use cBioPortal data in your research, please cite:
Cerami E. et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discov. 2012. https://doi.org/10.1158/2159-8290.CD-12-0095
src/oncolearn/api/cbioportal/
├── builder.py # Builder pattern for creating cohorts from YAML
├── cbioportal_dataset.py # Dataset class for cBioPortal data
├── client.py # Thin REST client (GET/POST with retry logic)
└── download.py # Download utilities
data/configs/cbioportal/ # YAML configuration files
├── brca.yaml
└── ... (one file per configured cohort)
The builder reads YAML config files and handles all API calls internally.
from oncolearn.api.cbioportal import CBioPortalCohortBuilder
builder = CBioPortalCohortBuilder()
# Build and download a cohort
brca_cohort = builder.build_cohort("BRCA")
brca_cohort.download() # Downloads all configured datasets to data/cbioportal/TCGA-BRCA/
# Download to a custom directory
brca_cohort.download(output_dir="my_data/cbioportal/brca")from oncolearn.api.cbioportal import CBioPortalCohortBuilder
builder = CBioPortalCohortBuilder()
cohorts = builder.list_available_cohorts()
print(cohorts) # ['BRCA', ...]from oncolearn.api.cbioportal import CBioPortalCohortBuilder
builder = CBioPortalCohortBuilder()
brca_cohort = builder.build_cohort("BRCA")
# List datasets defined in the YAML
dataset_names = brca_cohort.list_datasets()
print(dataset_names)
# Download a single dataset
clinical = brca_cohort.get_dataset("clinical")
clinical.download("my_data/cbioportal/brca")CBioPortalClient is a thin wrapper around the cBioPortal REST API. It handles JSON serialisation, pagination, and exponential-backoff retry on transient errors (429, 500–504).
from oncolearn.api.cbioportal.client import CBioPortalClient
client = CBioPortalClient() # defaults to https://www.cbioportal.org/api# List all studies
studies = client.list_studies()
# Search by keyword
brca_studies = client.list_studies(keyword="breast")
# Filter by cancer type
brca_studies = client.list_studies(cancer_type_id="brca")
# Get a single study
study = client.get_study("brca_tcga")# Get all samples for a study
samples = client.get_samples("brca_tcga")
# Get just the sample IDs
sample_ids = client.get_sample_ids("brca_tcga")# Get all patient-level clinical attributes
records = client.get_clinical_data("brca_tcga", clinical_data_type="PATIENT")
# Get a specific subset of attributes
records = client.get_clinical_data(
"brca_tcga",
clinical_data_type="PATIENT",
attribute_ids=["AJCC_PATHOLOGIC_TUMOR_STAGE", "OS_STATUS"],
)
# List available attributes
attrs = client.get_clinical_attributes("brca_tcga")# List all molecular profiles for a study
profiles = client.get_molecular_profiles("brca_tcga")
# Get expression data using a sample list
expr = client.get_molecular_data(
molecular_profile_id="brca_tcga_rna_seq_v2_mrna",
sample_list_id="brca_tcga_rna_seq_v2_mrna",
)
# Get expression data for specific samples
expr = client.get_molecular_data(
molecular_profile_id="brca_tcga_rna_seq_v2_mrna",
sample_ids=["TCGA-A1-A0SB-01", "TCGA-A1-A0SD-01"],
study_id="brca_tcga",
)mutations = client.get_mutations(
molecular_profile_id="brca_tcga_mutations",
sample_list_id="brca_tcga_all",
)# Fetches per-sample segments (iterates internally; may be slow for large cohorts)
segments = client.get_copy_number_segments("brca_tcga")
# For a subset of samples
segments = client.get_copy_number_segments("brca_tcga", sample_ids=sample_ids)sv = client.get_structural_variants(
molecular_profile_id="brca_tcga_sv",
study_id="brca_tcga",
)Covers phosphoproteomics, methylation probes, arm-level CNA, genetic ancestry, and other GENERIC_ASSAY profiles.
data = client.get_generic_assay_data(
molecular_profile_id="brca_tcga_rppa_Zscores",
sample_list_id="brca_tcga_all",
)Each cohort is defined in data/configs/cbioportal/<code>.yaml:
cohort:
name: TCGA Breast Invasive Carcinoma
code: BRCA
study_id: brca_tcga
description: "TCGA BRCA cohort via cBioPortal REST API"
default_output_subdir: TCGA-BRCA
datasets:
- name: clinical
description: "Patient clinical attributes"
category: clinical
type: clinical
clinical_data_type: PATIENT
attribute_ids: [] # empty = all attributes
filename: TCGA-BRCA.clinical.tsv
- name: mutations
description: "Somatic mutations"
category: mutation
type: mutations
molecular_profile_id: brca_tcga_mutations
sample_list_id: brca_tcga_all
filename: TCGA-BRCA.mutations.tsvtype |
Endpoint | Key Fields |
|---|---|---|
clinical |
/studies/{id}/clinical-data |
clinical_data_type, attribute_ids
|
mutations |
/molecular-profiles/{id}/mutations/fetch |
molecular_profile_id, sample_list_id
|
molecular |
/molecular-profiles/{id}/molecular-data/fetch |
molecular_profile_id, sample_list_id
|
copy_number_segments |
/studies/{id}/samples/{id}/copy-number-segments |
(none extra) |
structural_variants |
/molecular-profiles/{id}/structural-variant/fetch |
molecular_profile_id |
generic_assay |
/generic_assay_data/{id}/fetch |
molecular_profile_id, sample_list_id
|
-
Find the
study_idusing the CLI or client:oncolearn cbioportal download --list --search breast --cancer-type brca
studies = client.list_studies(cancer_type_id="brca") for s in studies: print(s["studyId"], s["name"])
-
Create
data/configs/cbioportal/<code>.yamlfollowing the structure above. -
Run
oncolearn cbioportal download --cohorts <CODE>— no Python code changes needed.
OncoLearn | A comprehensive toolkit for cancer genomics analysis and biomarker discovery.
Built with ❤️ for cancer research