API cBioPortal

cBioPortal API

The cBioPortal module provides a Python API for downloading cancer study data from cBioPortal via its public REST API (v3).

Citation

If you use cBioPortal data in your research, please cite:

Cerami E. et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discov. 2012. https://doi.org/10.1158/2159-8290.CD-12-0095

Module Structure

src/oncolearn/api/cbioportal/
├── builder.py               # Builder pattern for creating cohorts from YAML
├── cbioportal_dataset.py    # Dataset class for cBioPortal data
├── client.py                # Thin REST client (GET/POST with retry logic)
└── download.py              # Download utilities

data/configs/cbioportal/     # YAML configuration files
├── brca.yaml
└── ... (one file per configured cohort)

High-Level API (Builder)

The builder reads YAML config files and handles all API calls internally.

Basic Usage

from oncolearn.api.cbioportal import CBioPortalCohortBuilder

builder = CBioPortalCohortBuilder()

# Build and download a cohort
brca_cohort = builder.build_cohort("BRCA")
brca_cohort.download()  # Downloads all configured datasets to data/cbioportal/TCGA-BRCA/

# Download to a custom directory
brca_cohort.download(output_dir="my_data/cbioportal/brca")

List Available Cohorts

from oncolearn.api.cbioportal import CBioPortalCohortBuilder

builder = CBioPortalCohortBuilder()
cohorts = builder.list_available_cohorts()
print(cohorts)  # ['BRCA', ...]

Download Specific Datasets

from oncolearn.api.cbioportal import CBioPortalCohortBuilder

builder = CBioPortalCohortBuilder()
brca_cohort = builder.build_cohort("BRCA")

# List datasets defined in the YAML
dataset_names = brca_cohort.list_datasets()
print(dataset_names)

# Download a single dataset
clinical = brca_cohort.get_dataset("clinical")
clinical.download("my_data/cbioportal/brca")

Low-Level API (REST Client)

CBioPortalClient is a thin wrapper around the cBioPortal REST API. It handles JSON serialisation, pagination, and exponential-backoff retry on transient errors (429, 500–504).

from oncolearn.api.cbioportal.client import CBioPortalClient

client = CBioPortalClient()  # defaults to https://www.cbioportal.org/api

Studies

# List all studies
studies = client.list_studies()

# Search by keyword
brca_studies = client.list_studies(keyword="breast")

# Filter by cancer type
brca_studies = client.list_studies(cancer_type_id="brca")

# Get a single study
study = client.get_study("brca_tcga")

Samples

# Get all samples for a study
samples = client.get_samples("brca_tcga")

# Get just the sample IDs
sample_ids = client.get_sample_ids("brca_tcga")

Clinical Data

# Get all patient-level clinical attributes
records = client.get_clinical_data("brca_tcga", clinical_data_type="PATIENT")

# Get a specific subset of attributes
records = client.get_clinical_data(
    "brca_tcga",
    clinical_data_type="PATIENT",
    attribute_ids=["AJCC_PATHOLOGIC_TUMOR_STAGE", "OS_STATUS"],
)

# List available attributes
attrs = client.get_clinical_attributes("brca_tcga")

Molecular Data

# List all molecular profiles for a study
profiles = client.get_molecular_profiles("brca_tcga")

# Get expression data using a sample list
expr = client.get_molecular_data(
    molecular_profile_id="brca_tcga_rna_seq_v2_mrna",
    sample_list_id="brca_tcga_rna_seq_v2_mrna",
)

# Get expression data for specific samples
expr = client.get_molecular_data(
    molecular_profile_id="brca_tcga_rna_seq_v2_mrna",
    sample_ids=["TCGA-A1-A0SB-01", "TCGA-A1-A0SD-01"],
    study_id="brca_tcga",
)

Mutations

mutations = client.get_mutations(
    molecular_profile_id="brca_tcga_mutations",
    sample_list_id="brca_tcga_all",
)

Copy-Number Segments

# Fetches per-sample segments (iterates internally; may be slow for large cohorts)
segments = client.get_copy_number_segments("brca_tcga")

# For a subset of samples
segments = client.get_copy_number_segments("brca_tcga", sample_ids=sample_ids)

Structural Variants

sv = client.get_structural_variants(
    molecular_profile_id="brca_tcga_sv",
    study_id="brca_tcga",
)

Generic Assay Data

Covers phosphoproteomics, methylation probes, arm-level CNA, genetic ancestry, and other GENERIC_ASSAY profiles.

data = client.get_generic_assay_data(
    molecular_profile_id="brca_tcga_rppa_Zscores",
    sample_list_id="brca_tcga_all",
)

YAML Configuration Format

Each cohort is defined in data/configs/cbioportal/<code>.yaml:

cohort:
  name: TCGA Breast Invasive Carcinoma
  code: BRCA
  study_id: brca_tcga
  description: "TCGA BRCA cohort via cBioPortal REST API"
  default_output_subdir: TCGA-BRCA

datasets:
  - name: clinical
    description: "Patient clinical attributes"
    category: clinical
    type: clinical
    clinical_data_type: PATIENT
    attribute_ids: []           # empty = all attributes
    filename: TCGA-BRCA.clinical.tsv

  - name: mutations
    description: "Somatic mutations"
    category: mutation
    type: mutations
    molecular_profile_id: brca_tcga_mutations
    sample_list_id: brca_tcga_all
    filename: TCGA-BRCA.mutations.tsv

Dataset Type Reference

`type`	Endpoint	Key Fields
`clinical`	`/studies/{id}/clinical-data`	`clinical_data_type`, `attribute_ids`
`mutations`	`/molecular-profiles/{id}/mutations/fetch`	`molecular_profile_id`, `sample_list_id`
`molecular`	`/molecular-profiles/{id}/molecular-data/fetch`	`molecular_profile_id`, `sample_list_id`
`copy_number_segments`	`/studies/{id}/samples/{id}/copy-number-segments`	(none extra)
`structural_variants`	`/molecular-profiles/{id}/structural-variant/fetch`	`molecular_profile_id`
`generic_assay`	`/generic_assay_data/{id}/fetch`	`molecular_profile_id`, `sample_list_id`

Adding New Cohorts

Find the study_id using the CLI or client:

oncolearn cbioportal download --list --search breast --cancer-type brca

studies = client.list_studies(cancer_type_id="brca")
for s in studies:
    print(s["studyId"], s["name"])

Create data/configs/cbioportal/<code>.yaml following the structure above.
Run oncolearn cbioportal download --cohorts <CODE> — no Python code changes needed.

OncoLearn | A comprehensive toolkit for cancer genomics analysis and biomarker discovery.

API cBioPortal

cBioPortal API

Citation

Module Structure

High-Level API (Builder)

Basic Usage

List Available Cohorts

Download Specific Datasets

Low-Level API (REST Client)

Studies

Samples

Clinical Data

Molecular Data

Mutations

Copy-Number Segments

Structural Variants

Generic Assay Data

YAML Configuration Format

Dataset Type Reference

Adding New Cohorts

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OncoLearn Wiki

Overview

Getting Started

API

CLI

Modeling

Guides

Clone this wiki locally