Skip to content

API cBioPortal

andrewscouten edited this page Mar 11, 2026 · 1 revision

cBioPortal API

The cBioPortal module provides a Python API for downloading cancer study data from cBioPortal via its public REST API (v3).

Citation

If you use cBioPortal data in your research, please cite:

Cerami E. et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discov. 2012. https://doi.org/10.1158/2159-8290.CD-12-0095

Module Structure

src/oncolearn/api/cbioportal/
├── builder.py               # Builder pattern for creating cohorts from YAML
├── cbioportal_dataset.py    # Dataset class for cBioPortal data
├── client.py                # Thin REST client (GET/POST with retry logic)
└── download.py              # Download utilities

data/configs/cbioportal/     # YAML configuration files
├── brca.yaml
└── ... (one file per configured cohort)

High-Level API (Builder)

The builder reads YAML config files and handles all API calls internally.

Basic Usage

from oncolearn.api.cbioportal import CBioPortalCohortBuilder

builder = CBioPortalCohortBuilder()

# Build and download a cohort
brca_cohort = builder.build_cohort("BRCA")
brca_cohort.download()  # Downloads all configured datasets to data/cbioportal/TCGA-BRCA/

# Download to a custom directory
brca_cohort.download(output_dir="my_data/cbioportal/brca")

List Available Cohorts

from oncolearn.api.cbioportal import CBioPortalCohortBuilder

builder = CBioPortalCohortBuilder()
cohorts = builder.list_available_cohorts()
print(cohorts)  # ['BRCA', ...]

Download Specific Datasets

from oncolearn.api.cbioportal import CBioPortalCohortBuilder

builder = CBioPortalCohortBuilder()
brca_cohort = builder.build_cohort("BRCA")

# List datasets defined in the YAML
dataset_names = brca_cohort.list_datasets()
print(dataset_names)

# Download a single dataset
clinical = brca_cohort.get_dataset("clinical")
clinical.download("my_data/cbioportal/brca")

Low-Level API (REST Client)

CBioPortalClient is a thin wrapper around the cBioPortal REST API. It handles JSON serialisation, pagination, and exponential-backoff retry on transient errors (429, 500–504).

from oncolearn.api.cbioportal.client import CBioPortalClient

client = CBioPortalClient()  # defaults to https://www.cbioportal.org/api

Studies

# List all studies
studies = client.list_studies()

# Search by keyword
brca_studies = client.list_studies(keyword="breast")

# Filter by cancer type
brca_studies = client.list_studies(cancer_type_id="brca")

# Get a single study
study = client.get_study("brca_tcga")

Samples

# Get all samples for a study
samples = client.get_samples("brca_tcga")

# Get just the sample IDs
sample_ids = client.get_sample_ids("brca_tcga")

Clinical Data

# Get all patient-level clinical attributes
records = client.get_clinical_data("brca_tcga", clinical_data_type="PATIENT")

# Get a specific subset of attributes
records = client.get_clinical_data(
    "brca_tcga",
    clinical_data_type="PATIENT",
    attribute_ids=["AJCC_PATHOLOGIC_TUMOR_STAGE", "OS_STATUS"],
)

# List available attributes
attrs = client.get_clinical_attributes("brca_tcga")

Molecular Data

# List all molecular profiles for a study
profiles = client.get_molecular_profiles("brca_tcga")

# Get expression data using a sample list
expr = client.get_molecular_data(
    molecular_profile_id="brca_tcga_rna_seq_v2_mrna",
    sample_list_id="brca_tcga_rna_seq_v2_mrna",
)

# Get expression data for specific samples
expr = client.get_molecular_data(
    molecular_profile_id="brca_tcga_rna_seq_v2_mrna",
    sample_ids=["TCGA-A1-A0SB-01", "TCGA-A1-A0SD-01"],
    study_id="brca_tcga",
)

Mutations

mutations = client.get_mutations(
    molecular_profile_id="brca_tcga_mutations",
    sample_list_id="brca_tcga_all",
)

Copy-Number Segments

# Fetches per-sample segments (iterates internally; may be slow for large cohorts)
segments = client.get_copy_number_segments("brca_tcga")

# For a subset of samples
segments = client.get_copy_number_segments("brca_tcga", sample_ids=sample_ids)

Structural Variants

sv = client.get_structural_variants(
    molecular_profile_id="brca_tcga_sv",
    study_id="brca_tcga",
)

Generic Assay Data

Covers phosphoproteomics, methylation probes, arm-level CNA, genetic ancestry, and other GENERIC_ASSAY profiles.

data = client.get_generic_assay_data(
    molecular_profile_id="brca_tcga_rppa_Zscores",
    sample_list_id="brca_tcga_all",
)

YAML Configuration Format

Each cohort is defined in data/configs/cbioportal/<code>.yaml:

cohort:
  name: TCGA Breast Invasive Carcinoma
  code: BRCA
  study_id: brca_tcga
  description: "TCGA BRCA cohort via cBioPortal REST API"
  default_output_subdir: TCGA-BRCA

datasets:
  - name: clinical
    description: "Patient clinical attributes"
    category: clinical
    type: clinical
    clinical_data_type: PATIENT
    attribute_ids: []           # empty = all attributes
    filename: TCGA-BRCA.clinical.tsv

  - name: mutations
    description: "Somatic mutations"
    category: mutation
    type: mutations
    molecular_profile_id: brca_tcga_mutations
    sample_list_id: brca_tcga_all
    filename: TCGA-BRCA.mutations.tsv

Dataset Type Reference

type Endpoint Key Fields
clinical /studies/{id}/clinical-data clinical_data_type, attribute_ids
mutations /molecular-profiles/{id}/mutations/fetch molecular_profile_id, sample_list_id
molecular /molecular-profiles/{id}/molecular-data/fetch molecular_profile_id, sample_list_id
copy_number_segments /studies/{id}/samples/{id}/copy-number-segments (none extra)
structural_variants /molecular-profiles/{id}/structural-variant/fetch molecular_profile_id
generic_assay /generic_assay_data/{id}/fetch molecular_profile_id, sample_list_id

Adding New Cohorts

  1. Find the study_id using the CLI or client:

    oncolearn cbioportal download --list --search breast --cancer-type brca
    studies = client.list_studies(cancer_type_id="brca")
    for s in studies:
        print(s["studyId"], s["name"])
  2. Create data/configs/cbioportal/<code>.yaml following the structure above.

  3. Run oncolearn cbioportal download --cohorts <CODE> — no Python code changes needed.

Clone this wiki locally