SecActPy

Secreted Protein Activity Inference using Ridge Regression

Python implementation of SecAct for inferring secreted protein activities from gene expression data.

Key Features:

SecAct Compatible: Matches R SecAct/RidgeR results on the same platform (rng_method='srand')
GPU Acceleration: Optional CuPy backend for large-scale analysis
Million-Sample Scale: Batch processing with streaming output for massive datasets
Streaming H5AD: Two-pass chunk reading for >5M-cell datasets without loading the full matrix (~3 GB peak vs ~200 GB)
Built-in Signatures: Includes SecAct and CytoSig signature matrices
Multi-Platform Support: Bulk RNA-seq, scRNA-seq, and Spatial Transcriptomics (Visium, CosMx)
Smart Caching: Optional permutation table caching for faster repeated analyses
Sparse-Aware: Automatic memory-efficient processing for sparse single-cell data

Installation

Recommended: Create a virtual environment before installing to avoid dependency conflicts with other packages.
python -m venv secactpy-env
source secactpy-env/bin/activate   # Linux/macOS
# secactpy-env\Scripts\activate    # Windows

From PyPI (Recommended)

# CPU Only
pip install secactpy

# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu]"

# With GPU Support (CUDA 12.x)
pip install secactpy
pip install cupy-cuda12x

From GitHub

# CPU Only
pip install git+https://github.com/data2intelligence/SecActpy.git

# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu] @ git+https://github.com/data2intelligence/SecActpy.git"

# With GPU Support (CUDA 12.x)
pip install git+https://github.com/data2intelligence/SecActpy.git
pip install cupy-cuda12x

Development Installation

git clone https://github.com/data2intelligence/SecActpy.git
cd SecActpy
pip install -e ".[dev]"

Quick Start

Example Data

Example datasets for all Quick Start tutorials are available on Zenodo:

Example	Input File	Output File	Size
Bulk RNA-seq	`Ly86-Fc_vs_Vehicle_logFC.txt`	`Ly86-Fc_vs_Vehicle_logFC_output.h5ad`	0.5 MB
scRNA-seq (OV CD4 T cells)	`OV_scRNAseq_CD4.h5ad`	`OV_scRNAseq_ct_CD4_output.h5ad`, `OV_scRNAseq_sc_CD4_output.h5ad`	34 MB
Visium ST (HCC)	`Visium_HCC_data.h5ad`	`Visium_HCC_output.h5ad`	255 MB
CosMx (LIHC)	`LIHC_CosMx_data.h5ad`	`LIHC_CosMx_output.h5ad`	3.0 GB

Download all example files:

# Download individual files from Zenodo
wget https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
wget https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
wget https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad

Example 1: Bulk RNA-seq

import pandas as pd
from secactpy import secact_activity_inference

# Load differential expression data (genes × samples)
# Download: https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
diff_expr = pd.read_csv("Ly86-Fc_vs_Vehicle_logFC.txt", sep=r"\s+", index_col=0)

# Run inference
result = secact_activity_inference(
    diff_expr,
    is_differential=True,
    sig_matrix="secact",  # or "cytosig"
    verbose=True
)

# Access results
activity = result['zscore']    # Activity z-scores
pvalues = result['pvalue']     # P-values
coefficients = result['beta']  # Regression coefficients

Note: Set is_differential=True when the input is already log fold-change data. For single-column input with no control, row-mean centering is automatically skipped (it would produce all zeros).

Example 2: scRNA-seq Analysis

import anndata as ad
from secactpy import secact_activity_inference_scrnaseq

# Load scRNA-seq data (788 OV CD4 T cells, 3 subtypes)
# Download: https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")

# Pseudo-bulk by cell type
result = secact_activity_inference_scrnaseq(
    adata,
    cell_type_col="Annotation",
    is_single_cell_level=False,
    verbose=True
)

# Single-cell level
result_sc = secact_activity_inference_scrnaseq(
    adata,
    cell_type_col="Annotation",
    is_single_cell_level=True,
    verbose=True
)

Example 3: Spatial Transcriptomics

Visium (spot-level)

from secactpy import secact_activity_inference_st

# Load Visium HCC data (3,415 spots)
# Download: https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
result = secact_activity_inference_st(
    "Visium_HCC_data.h5ad",
    min_genes=1000,
    verbose=True
)

activity = result['zscore']  # (proteins × spots)

CosMx (single-cell spatial)

import anndata as ad
from secactpy import secact_activity_inference_st

# Load CosMx LIHC data (443,515 cells, 1,000 genes, 12 cell types)
# Download: https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad
adata = ad.read_h5ad("LIHC_CosMx_data.h5ad")

# Single-cell resolution (one score per cell)
result = secact_activity_inference_st(
    adata,
    is_spot_level=True,         # Score each cell individually (default)
    batch_size=5000,            # Process in chunks to limit memory
    output_path="cosmx_sc_results.h5ad",  # Stream to disk
    verbose=True
)
# result is None when output_path is set; load with ad.read_h5ad()

# Cell-type resolution (pseudo-bulk by cell type)
result = secact_activity_inference_st(
    adata,
    cell_type_col="cellType",  # Column in adata.obs
    is_spot_level=False,        # Aggregate by cell type
    verbose=True
)

activity = result['zscore']  # (proteins × cell_types)

Batch Processing

For large datasets (50,000+ samples), batch processing splits computation into memory-efficient chunks while producing mathematically identical results. The projection matrix is computed once, then samples are processed in chunks. Set batch_size on any high-level function:

result = secact_activity_inference(expr_df, ..., batch_size=5000)
result = secact_activity_inference_scrnaseq(adata, ..., batch_size=5000)
result = secact_activity_inference_st(adata, ..., batch_size=5000)

Mode	Parameter	Return value	Memory for output
In-memory (default)	`output_path=None`	`dict` of DataFrames	All results in RAM
Streaming	`output_path="results.h5ad"`	`None`	Only one batch at a time

Setting sparse_mode=True keeps sparse Y matrices in sparse format end-to-end, avoiding densification and reducing memory by orders of magnitude for highly sparse single-cell data (<5% density: ~1.8x faster; results identical).

See Batch Processing for worked examples and streaming output details.

Streaming H5AD (>5M Cells)

For very large single-cell datasets (>5M cells) that exceed available RAM even with batch processing, streaming=True bypasses full-matrix loading entirely. The H5AD file is read in chunks via h5py using a two-pass algorithm:

Pass 1: Read cell chunks, normalize (CPM + log2), accumulate row/column statistics
Pass 2: Re-read chunks, compute per-chunk cross terms, run inference in sub-batches

Peak memory drops from ~200 GB to ~3 GB for a 5M-cell dataset. Results are numerically identical to the non-streaming path.

# scRNA-seq: 6.5M cells, ~3 GB peak memory
result = secact_activity_inference_scrnaseq(
    "large_atlas.h5ad",               # file path (not AnnData object)
    cell_type_col="cell_type",
    is_single_cell_level=True,
    streaming=True,                    # enable two-pass chunk reading
    streaming_chunk_size=50_000,       # cells per chunk (default)
    output_path="results.h5ad",        # stream results to disk
    verbose=True,
)

# Spatial transcriptomics: same interface
result = secact_activity_inference_st(
    "large_spatial.h5ad",
    streaming=True,
    output_path="st_results.h5ad",
    verbose=True,
)

Requirements: streaming=True requires adata to be a file path (not an in-memory AnnData), is_single_cell_level=True (scRNA-seq) or is_spot_level=True (ST), and the H5AD must store X in sparse (CSR/CSC) format.

See Batch Processing for full details.

API Reference

See API Reference for full function signatures, parameters, and options. For low-level ridge() / ridge_batch() usage, see Advanced API.

GPU Acceleration

from secactpy import secact_activity_inference, CUPY_AVAILABLE

print(f"GPU available: {CUPY_AVAILABLE}")
result = secact_activity_inference(expression, backend='auto')

Dataset	Py (CPU)	Py (GPU)	Speedup
Bulk (1,170 sp × 1,000 samples)	128.8s	6.7s	11–19x
scRNA-seq (1,170 sp × 788 cells)	104.8s	6.8s	8–15x
Visium (1,170 sp × 3,404 spots)	381.4s	11.2s	13–34x
CosMx (151 sp × 443,515 cells)	1226.7s	99.9s	9–12x

See GPU Acceleration for full benchmarks and CUDA setup. See DOCKER.md for Docker vs native performance benchmarks.

Command Line Interface

secactpy bulk -i diff_expr.tsv -o results.h5ad --differential -v
secactpy scrnaseq -i data.h5ad -o results.h5ad --cell-type-col celltype -v
secactpy visium -i /path/to/visium/ -o results.h5ad -v
secactpy cosmx -i cosmx.h5ad -o results.h5ad --batch-size 50000 -v

Option	Description
`-i, --input`	Input file or directory
`-o, --output`	Output H5AD file
`-s, --signature`	Signature matrix (secact, cytosig)
`--backend`	Computation backend (auto, numpy, cupy)
`--batch-size`	Batch size for large datasets
`-v, --verbose`	Verbose output

See CLI Reference for all commands and options.

Docker

docker pull psychemistz/secactpy:latest      # CPU
docker pull psychemistz/secactpy:gpu          # GPU
docker pull psychemistz/secactpy:with-r       # With R SecAct/RidgeR

See DOCKER.md for detailed usage instructions.

Reproducibility

SecActPy supports three RNG backends for different reproducibility needs:

`rng_method`	Description	Use case
`'srand'`	C stdlib `srand()`/`rand()` via ctypes	Match R SecAct/RidgeR results on the same platform
`'gsl'`	Mersenne Twister (GSL-compatible)	Cross-platform reproducibility within SecActPy
`'numpy'`	Native NumPy RNG (~70x faster)	Fast analysis when reproducibility with R is not needed

# Match R SecAct on same platform (default)
result = secact_activity_inference(expr, rng_method="srand")

# Cross-platform reproducible
result = secact_activity_inference(expr, rng_method="gsl")

# Fastest (~70x faster permutations)
result = secact_activity_inference(expr, rng_method="numpy")

See Reproducibility for detailed examples.

Requirements

Python ≥ 3.9
NumPy ≥ 1.20
Pandas ≥ 1.3
SciPy ≥ 1.7
h5py ≥ 3.0
anndata ≥ 0.8
scanpy ≥ 1.9

Optional: CuPy ≥ 10.0 (GPU acceleration)

Citation

If you use SecActPy in your research, please cite:

Beibei Ru, Lanqi Gong, Emily Yang, Seongyong Park, George Zaki, Kenneth Aldape, Lalage Wakefield, Peng Jiang. Inference of secreted protein activities in intercellular communication. Nature Methods, 2026 (In press)

Related Projects

SecAct - Original R implementation
RidgeR - R ridge regression package
SpaCET - Spatial transcriptomics cell type analysis
CytoSig - Cytokine signaling inference

License

MIT License - see LICENSE for details.

Changelog

See CHANGELOG.md for full version history.

v0.2.5

Streaming H5AD: streaming=True for two-pass chunk reading of >5M-cell datasets (~3 GB peak)
H5ADChunkReader for memory-efficient H5AD reading via h5py
Fixed H5AD index column detection for obs.attrs['_index'] convention

v0.2.4

col_center and col_scale parameters for independent control of sparse in-flight normalization

v0.2.3

rng_method parameter for explicit RNG selection
is_group_sig=True by default

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github/workflows		.github/workflows
apptainer		apptainer
dataset		dataset
docs		docs
examples		examples
man		man
scripts		scripts
secactpy		secactpy
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
DOCKER.md		DOCKER.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SecActPy

Installation

From PyPI (Recommended)

From GitHub

Development Installation

Quick Start

Example Data

Example 1: Bulk RNA-seq

Example 2: scRNA-seq Analysis

Example 3: Spatial Transcriptomics

Visium (spot-level)

CosMx (single-cell spatial)

Batch Processing

Streaming H5AD (>5M Cells)

API Reference

GPU Acceleration

Command Line Interface

Docker

Reproducibility

Requirements

Citation

Related Projects

License

Changelog

v0.2.5

v0.2.4

v0.2.3

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SecActPy

Installation

From PyPI (Recommended)

From GitHub

Development Installation

Quick Start

Example Data

Example 1: Bulk RNA-seq

Example 2: scRNA-seq Analysis

Example 3: Spatial Transcriptomics

Visium (spot-level)

CosMx (single-cell spatial)

Batch Processing

Streaming H5AD (>5M Cells)

API Reference

GPU Acceleration

Command Line Interface

Docker

Reproducibility

Requirements

Citation

Related Projects

License

Changelog

v0.2.5

v0.2.4

v0.2.3

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages