Hey! I was benchmarking STAR-suite as a drop-in replacement for upstream STAR on plain 10x GEX data and ran into something I can't explain: STAR-suite's STARsolo throughput barely changes when I give it more threads, while upstream STAR 2.7.11b scales fine on the same machine. Wanted to check whether this is expected or whether I'm holding it wrong.
Setup
- STAR-suite v1.3.0b (tag
v1.3.0b, commit 975b23a), built from source: make core WITH_CHROMAP=0 (gcc/g++ on Debian bookworm, the Makefile's default -O3, no -march=native).
- Upstream STAR 2.7.11b = the official
Linux_x86_64_static release binary.
- Both binaries in the same container, same node, same index, same reads (so the only variable is the binary + thread count).
- Data: 10x pbmc_10k_v3 (~600M reads, 3' v3). Reference: GRCh38 (CellRanger refdata-gex-2020-A genome+GTF, fresh index built with STAR 2.7.11b). Node: 184-core / 1 TB RAM.
- Counting:
--soloFeatures GeneFull, EmptyDrops_CR, the usual CR-ish flag set (full command below).
The numbers (STARsolo mapping speed, from Log.final.out)
| threads |
upstream STAR 2.7.11b |
STAR-suite (tuned*) |
| 16 |
991 M reads/hr |
741 M reads/hr |
| 64 |
1431 M reads/hr |
759 M reads/hr |
* "tuned" = --dynamicThreadInterface 1 --dynamicThreadConstMapPermits <threads>. Without those flags it's basically the same (745 M reads/hr @ 64t), so the dynamic-thread flags didn't move the needle for me.
So upstream gets ~1.45x faster going 16→64 threads, but STAR-suite stays flat (~740–760) — it's like the extra cores aren't being used. Wall clock matches the story: upstream 39m→27m, STAR-suite ~56m→55m. STAR-suite also used ~1.6x the peak RAM (~62 GB vs ~39 GB at 64t).
To be clear, correctness is great — the count matrices are essentially identical to upstream (per-gene Pearson 0.9998, barcode Jaccard ~0.97, <0.2% of nonzero entries differ), and output is bit-identical across thread counts. This is purely a throughput/threading thing.
The exact mapping command
STAR --runThreadN <T> --genomeDir <GRCh38_index> \
--readFilesIn R2_L001,R2_L002 R1_L001,R1_L002 --readFilesCommand zcat \
--soloType CB_UMI_Simple --soloCBwhitelist 3M-february-2018.txt \
--soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12 --soloBarcodeReadLength 0 \
--soloFeatures GeneFull --soloStrand Forward \
--clipAdapterType CellRanger4 --outFilterScoreMin 30 \
--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
--soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR \
--soloCellFilter EmptyDrops_CR --outSAMtype None \
--dynamicThreadInterface 1 --dynamicThreadConstMapPermits <T> \
--outFileNamePrefix <out>/
Questions
- Is flat thread-scaling expected for plain GEX STARsolo, or should it scale like upstream? (I get that the headline speedups are for perturb-seq / the CR-compat path — just trying to understand the GEX case.)
- Am I missing a flag to actually get the parallel speedup? Is
--dynamicThreadConstMapPermits <threads> the right value, or should it be set differently relative to --runThreadN?
- Could this be my from-source build (default
-O3, no -march=native, WITH_CHROMAP=0) rather than the algorithm? Is there a recommended/optimized build recipe (or an official prebuilt binary) I should use for a fair speed comparison?
Happy to share full logs / Log.final.out for any of the runs. Thanks!
Hey! I was benchmarking STAR-suite as a drop-in replacement for upstream STAR on plain 10x GEX data and ran into something I can't explain: STAR-suite's STARsolo throughput barely changes when I give it more threads, while upstream STAR 2.7.11b scales fine on the same machine. Wanted to check whether this is expected or whether I'm holding it wrong.
Setup
v1.3.0b, commit975b23a), built from source:make core WITH_CHROMAP=0(gcc/g++ on Debian bookworm, the Makefile's default-O3, no-march=native).Linux_x86_64_staticrelease binary.--soloFeatures GeneFull, EmptyDrops_CR, the usual CR-ish flag set (full command below).The numbers (STARsolo mapping speed, from
Log.final.out)* "tuned" =
--dynamicThreadInterface 1 --dynamicThreadConstMapPermits <threads>. Without those flags it's basically the same (745 M reads/hr @ 64t), so the dynamic-thread flags didn't move the needle for me.So upstream gets ~1.45x faster going 16→64 threads, but STAR-suite stays flat (~740–760) — it's like the extra cores aren't being used. Wall clock matches the story: upstream 39m→27m, STAR-suite ~56m→55m. STAR-suite also used ~1.6x the peak RAM (~62 GB vs ~39 GB at 64t).
To be clear, correctness is great — the count matrices are essentially identical to upstream (per-gene Pearson 0.9998, barcode Jaccard ~0.97, <0.2% of nonzero entries differ), and output is bit-identical across thread counts. This is purely a throughput/threading thing.
The exact mapping command
Questions
--dynamicThreadConstMapPermits <threads>the right value, or should it be set differently relative to--runThreadN?-O3, no-march=native,WITH_CHROMAP=0) rather than the algorithm? Is there a recommended/optimized build recipe (or an official prebuilt binary) I should use for a fair speed comparison?Happy to share full logs /
Log.final.outfor any of the runs. Thanks!