Secreted Protein Activity Inference using Ridge Regression
Python implementation of SecAct for inferring secreted protein activities from gene expression data.
Key Features:
- SecAct Compatible: Matches R SecAct/RidgeR results on the same platform (
rng_method='srand') - GPU Acceleration: Optional CuPy backend for large-scale analysis
- Million-Sample Scale: Batch processing with streaming output for massive datasets
- Streaming H5AD: Two-pass chunk reading for >5M-cell datasets without loading the full matrix (~3 GB peak vs ~200 GB)
- Built-in Signatures: Includes SecAct and CytoSig signature matrices
- Multi-Platform Support: Bulk RNA-seq, scRNA-seq, and Spatial Transcriptomics (Visium, CosMx)
- Smart Caching: Optional permutation table caching for faster repeated analyses
- Sparse-Aware: Automatic memory-efficient processing for sparse single-cell data
Recommended: Create a virtual environment before installing to avoid dependency conflicts with other packages.
python -m venv secactpy-env source secactpy-env/bin/activate # Linux/macOS # secactpy-env\Scripts\activate # Windows
# CPU Only
pip install secactpy
# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu]"
# With GPU Support (CUDA 12.x)
pip install secactpy
pip install cupy-cuda12x# CPU Only
pip install git+https://github.com/data2intelligence/SecActpy.git
# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu] @ git+https://github.com/data2intelligence/SecActpy.git"
# With GPU Support (CUDA 12.x)
pip install git+https://github.com/data2intelligence/SecActpy.git
pip install cupy-cuda12xgit clone https://github.com/data2intelligence/SecActpy.git
cd SecActpy
pip install -e ".[dev]"Example datasets for all Quick Start tutorials are available on Zenodo:
| Example | Input File | Output File | Size |
|---|---|---|---|
| Bulk RNA-seq | Ly86-Fc_vs_Vehicle_logFC.txt |
Ly86-Fc_vs_Vehicle_logFC_output.h5ad |
0.5 MB |
| scRNA-seq (OV CD4 T cells) | OV_scRNAseq_CD4.h5ad |
OV_scRNAseq_ct_CD4_output.h5ad, OV_scRNAseq_sc_CD4_output.h5ad |
34 MB |
| Visium ST (HCC) | Visium_HCC_data.h5ad |
Visium_HCC_output.h5ad |
255 MB |
| CosMx (LIHC) | LIHC_CosMx_data.h5ad |
LIHC_CosMx_output.h5ad |
3.0 GB |
Download all example files:
# Download individual files from Zenodo
wget https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
wget https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
wget https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5adimport pandas as pd
from secactpy import secact_activity_inference
# Load differential expression data (genes × samples)
# Download: https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
diff_expr = pd.read_csv("Ly86-Fc_vs_Vehicle_logFC.txt", sep=r"\s+", index_col=0)
# Run inference
result = secact_activity_inference(
diff_expr,
is_differential=True,
sig_matrix="secact", # or "cytosig"
verbose=True
)
# Access results
activity = result['zscore'] # Activity z-scores
pvalues = result['pvalue'] # P-values
coefficients = result['beta'] # Regression coefficientsNote: Set
is_differential=Truewhen the input is already log fold-change data. For single-column input with no control, row-mean centering is automatically skipped (it would produce all zeros).
import anndata as ad
from secactpy import secact_activity_inference_scrnaseq
# Load scRNA-seq data (788 OV CD4 T cells, 3 subtypes)
# Download: https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")
# Pseudo-bulk by cell type
result = secact_activity_inference_scrnaseq(
adata,
cell_type_col="Annotation",
is_single_cell_level=False,
verbose=True
)
# Single-cell level
result_sc = secact_activity_inference_scrnaseq(
adata,
cell_type_col="Annotation",
is_single_cell_level=True,
verbose=True
)from secactpy import secact_activity_inference_st
# Load Visium HCC data (3,415 spots)
# Download: https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
result = secact_activity_inference_st(
"Visium_HCC_data.h5ad",
min_genes=1000,
verbose=True
)
activity = result['zscore'] # (proteins × spots)import anndata as ad
from secactpy import secact_activity_inference_st
# Load CosMx LIHC data (443,515 cells, 1,000 genes, 12 cell types)
# Download: https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad
adata = ad.read_h5ad("LIHC_CosMx_data.h5ad")
# Single-cell resolution (one score per cell)
result = secact_activity_inference_st(
adata,
is_spot_level=True, # Score each cell individually (default)
batch_size=5000, # Process in chunks to limit memory
output_path="cosmx_sc_results.h5ad", # Stream to disk
verbose=True
)
# result is None when output_path is set; load with ad.read_h5ad()
# Cell-type resolution (pseudo-bulk by cell type)
result = secact_activity_inference_st(
adata,
cell_type_col="cellType", # Column in adata.obs
is_spot_level=False, # Aggregate by cell type
verbose=True
)
activity = result['zscore'] # (proteins × cell_types)For large datasets (50,000+ samples), batch processing splits computation into
memory-efficient chunks while producing mathematically identical results.
The projection matrix is computed once, then samples are processed in chunks.
Set batch_size on any high-level function:
result = secact_activity_inference(expr_df, ..., batch_size=5000)
result = secact_activity_inference_scrnaseq(adata, ..., batch_size=5000)
result = secact_activity_inference_st(adata, ..., batch_size=5000)| Mode | Parameter | Return value | Memory for output |
|---|---|---|---|
| In-memory (default) | output_path=None |
dict of DataFrames |
All results in RAM |
| Streaming | output_path="results.h5ad" |
None |
Only one batch at a time |
Setting sparse_mode=True keeps sparse Y matrices in sparse format end-to-end,
avoiding densification and reducing memory by orders of magnitude for highly
sparse single-cell data (<5% density: ~1.8x faster; results identical).
See Batch Processing for worked examples and streaming output details.
For very large single-cell datasets (>5M cells) that exceed available RAM even
with batch processing, streaming=True bypasses full-matrix loading entirely.
The H5AD file is read in chunks via h5py using a two-pass algorithm:
- Pass 1: Read cell chunks, normalize (CPM + log2), accumulate row/column statistics
- Pass 2: Re-read chunks, compute per-chunk cross terms, run inference in sub-batches
Peak memory drops from ~200 GB to ~3 GB for a 5M-cell dataset. Results are numerically identical to the non-streaming path.
# scRNA-seq: 6.5M cells, ~3 GB peak memory
result = secact_activity_inference_scrnaseq(
"large_atlas.h5ad", # file path (not AnnData object)
cell_type_col="cell_type",
is_single_cell_level=True,
streaming=True, # enable two-pass chunk reading
streaming_chunk_size=50_000, # cells per chunk (default)
output_path="results.h5ad", # stream results to disk
verbose=True,
)
# Spatial transcriptomics: same interface
result = secact_activity_inference_st(
"large_spatial.h5ad",
streaming=True,
output_path="st_results.h5ad",
verbose=True,
)Requirements:
streaming=Truerequiresadatato be a file path (not an in-memory AnnData),is_single_cell_level=True(scRNA-seq) oris_spot_level=True(ST), and the H5AD must store X in sparse (CSR/CSC) format.
See Batch Processing for full details.
See API Reference for full function signatures, parameters, and options. For low-level ridge() / ridge_batch() usage, see Advanced API.
from secactpy import secact_activity_inference, CUPY_AVAILABLE
print(f"GPU available: {CUPY_AVAILABLE}")
result = secact_activity_inference(expression, backend='auto')| Dataset | Py (CPU) | Py (GPU) | Speedup |
|---|---|---|---|
| Bulk (1,170 sp × 1,000 samples) | 128.8s | 6.7s | 11–19x |
| scRNA-seq (1,170 sp × 788 cells) | 104.8s | 6.8s | 8–15x |
| Visium (1,170 sp × 3,404 spots) | 381.4s | 11.2s | 13–34x |
| CosMx (151 sp × 443,515 cells) | 1226.7s | 99.9s | 9–12x |
See GPU Acceleration for full benchmarks and CUDA setup. See DOCKER.md for Docker vs native performance benchmarks.
secactpy bulk -i diff_expr.tsv -o results.h5ad --differential -v
secactpy scrnaseq -i data.h5ad -o results.h5ad --cell-type-col celltype -v
secactpy visium -i /path/to/visium/ -o results.h5ad -v
secactpy cosmx -i cosmx.h5ad -o results.h5ad --batch-size 50000 -v| Option | Description |
|---|---|
-i, --input |
Input file or directory |
-o, --output |
Output H5AD file |
-s, --signature |
Signature matrix (secact, cytosig) |
--backend |
Computation backend (auto, numpy, cupy) |
--batch-size |
Batch size for large datasets |
-v, --verbose |
Verbose output |
See CLI Reference for all commands and options.
docker pull psychemistz/secactpy:latest # CPU
docker pull psychemistz/secactpy:gpu # GPU
docker pull psychemistz/secactpy:with-r # With R SecAct/RidgeRSee DOCKER.md for detailed usage instructions.
SecActPy supports three RNG backends for different reproducibility needs:
rng_method |
Description | Use case |
|---|---|---|
'srand' |
C stdlib srand()/rand() via ctypes |
Match R SecAct/RidgeR results on the same platform |
'gsl' |
Mersenne Twister (GSL-compatible) | Cross-platform reproducibility within SecActPy |
'numpy' |
Native NumPy RNG (~70x faster) | Fast analysis when reproducibility with R is not needed |
# Match R SecAct on same platform (default)
result = secact_activity_inference(expr, rng_method="srand")
# Cross-platform reproducible
result = secact_activity_inference(expr, rng_method="gsl")
# Fastest (~70x faster permutations)
result = secact_activity_inference(expr, rng_method="numpy")See Reproducibility for detailed examples.
- Python ≥ 3.9
- NumPy ≥ 1.20
- Pandas ≥ 1.3
- SciPy ≥ 1.7
- h5py ≥ 3.0
- anndata ≥ 0.8
- scanpy ≥ 1.9
Optional: CuPy ≥ 10.0 (GPU acceleration)
If you use SecActPy in your research, please cite:
Beibei Ru, Lanqi Gong, Emily Yang, Seongyong Park, George Zaki, Kenneth Aldape, Lalage Wakefield, Peng Jiang. Inference of secreted protein activities in intercellular communication. Nature Methods, 2026 (In press)
- SecAct - Original R implementation
- RidgeR - R ridge regression package
- SpaCET - Spatial transcriptomics cell type analysis
- CytoSig - Cytokine signaling inference
MIT License - see LICENSE for details.
See CHANGELOG.md for full version history.
- Streaming H5AD:
streaming=Truefor two-pass chunk reading of >5M-cell datasets (~3 GB peak) H5ADChunkReaderfor memory-efficient H5AD reading via h5py- Fixed H5AD index column detection for
obs.attrs['_index']convention
col_centerandcol_scaleparameters for independent control of sparse in-flight normalization
rng_methodparameter for explicit RNG selectionis_group_sig=Trueby default