GEX STARsolo throughput doesn't scale with threads (flat 16→64; slower than upstream STAR 2.7.11b)

Hey! I was benchmarking STAR-suite as a drop-in replacement for upstream STAR on plain 10x GEX data and ran into something I can't explain: **STAR-suite's STARsolo throughput barely changes when I give it more threads**, while upstream STAR 2.7.11b scales fine on the same machine. Wanted to check whether this is expected or whether I'm holding it wrong.

## Setup

- STAR-suite **v1.3.0b** (tag `v1.3.0b`, commit `975b23a`), built from source: `make core WITH_CHROMAP=0` (gcc/g++ on Debian bookworm, the Makefile's default `-O3`, no `-march=native`).
- Upstream **STAR 2.7.11b** = the official `Linux_x86_64_static` release binary.
- **Both binaries in the same container, same node, same index, same reads** (so the only variable is the binary + thread count).
- Data: 10x **pbmc_10k_v3** (~600M reads, 3' v3). Reference: GRCh38 (CellRanger refdata-gex-2020-A genome+GTF, fresh index built with STAR 2.7.11b). Node: 184-core / 1 TB RAM.
- Counting: `--soloFeatures GeneFull`, EmptyDrops_CR, the usual CR-ish flag set (full command below).

## The numbers (STARsolo mapping speed, from `Log.final.out`)

| threads | upstream STAR 2.7.11b | STAR-suite (tuned*) |
|---|---|---|
| 16 | 991 M reads/hr | 741 M reads/hr |
| 64 | **1431 M reads/hr** | 759 M reads/hr |

\* "tuned" = `--dynamicThreadInterface 1 --dynamicThreadConstMapPermits <threads>`. Without those flags it's basically the same (745 M reads/hr @ 64t), so the dynamic-thread flags didn't move the needle for me.

So upstream gets ~1.45x faster going 16→64 threads, but STAR-suite stays flat (~740–760) — it's like the extra cores aren't being used. Wall clock matches the story: upstream 39m→27m, STAR-suite ~56m→55m. STAR-suite also used ~1.6x the peak RAM (~62 GB vs ~39 GB at 64t).

To be clear, **correctness is great** — the count matrices are essentially identical to upstream (per-gene Pearson 0.9998, barcode Jaccard ~0.97, <0.2% of nonzero entries differ), and output is bit-identical across thread counts. This is purely a throughput/threading thing.

## The exact mapping command

```
STAR --runThreadN <T> --genomeDir <GRCh38_index> \
  --readFilesIn R2_L001,R2_L002 R1_L001,R1_L002 --readFilesCommand zcat \
  --soloType CB_UMI_Simple --soloCBwhitelist 3M-february-2018.txt \
  --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12 --soloBarcodeReadLength 0 \
  --soloFeatures GeneFull --soloStrand Forward \
  --clipAdapterType CellRanger4 --outFilterScoreMin 30 \
  --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
  --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR \
  --soloCellFilter EmptyDrops_CR --outSAMtype None \
  --dynamicThreadInterface 1 --dynamicThreadConstMapPermits <T> \
  --outFileNamePrefix <out>/
```

## Questions

1. Is flat thread-scaling expected for plain GEX STARsolo, or should it scale like upstream? (I get that the headline speedups are for perturb-seq / the CR-compat path — just trying to understand the GEX case.)
2. Am I missing a flag to actually get the parallel speedup? Is `--dynamicThreadConstMapPermits <threads>` the right value, or should it be set differently relative to `--runThreadN`?
3. Could this be my from-source build (default `-O3`, no `-march=native`, `WITH_CHROMAP=0`) rather than the algorithm? Is there a recommended/optimized build recipe (or an official prebuilt binary) I should use for a fair speed comparison?

Happy to share full logs / `Log.final.out` for any of the runs. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GEX STARsolo throughput doesn't scale with threads (flat 16→64; slower than upstream STAR 2.7.11b) #3

Setup

The numbers (STARsolo mapping speed, from `Log.final.out`)

The exact mapping command

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

threads	upstream STAR 2.7.11b	STAR-suite (tuned*)
16	991 M reads/hr	741 M reads/hr
64	1431 M reads/hr	759 M reads/hr

Uh oh!

GEX STARsolo throughput doesn't scale with threads (flat 16→64; slower than upstream STAR 2.7.11b) #3

Description

Setup

The numbers (STARsolo mapping speed, from Log.final.out)

The exact mapping command

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

The numbers (STARsolo mapping speed, from `Log.final.out`)