Skip to content

GEX STARsolo throughput doesn't scale with threads (flat 16→64; slower than upstream STAR 2.7.11b) #3

Description

@nick-youngblut

Hey! I was benchmarking STAR-suite as a drop-in replacement for upstream STAR on plain 10x GEX data and ran into something I can't explain: STAR-suite's STARsolo throughput barely changes when I give it more threads, while upstream STAR 2.7.11b scales fine on the same machine. Wanted to check whether this is expected or whether I'm holding it wrong.

Setup

  • STAR-suite v1.3.0b (tag v1.3.0b, commit 975b23a), built from source: make core WITH_CHROMAP=0 (gcc/g++ on Debian bookworm, the Makefile's default -O3, no -march=native).
  • Upstream STAR 2.7.11b = the official Linux_x86_64_static release binary.
  • Both binaries in the same container, same node, same index, same reads (so the only variable is the binary + thread count).
  • Data: 10x pbmc_10k_v3 (~600M reads, 3' v3). Reference: GRCh38 (CellRanger refdata-gex-2020-A genome+GTF, fresh index built with STAR 2.7.11b). Node: 184-core / 1 TB RAM.
  • Counting: --soloFeatures GeneFull, EmptyDrops_CR, the usual CR-ish flag set (full command below).

The numbers (STARsolo mapping speed, from Log.final.out)

threads upstream STAR 2.7.11b STAR-suite (tuned*)
16 991 M reads/hr 741 M reads/hr
64 1431 M reads/hr 759 M reads/hr

* "tuned" = --dynamicThreadInterface 1 --dynamicThreadConstMapPermits <threads>. Without those flags it's basically the same (745 M reads/hr @ 64t), so the dynamic-thread flags didn't move the needle for me.

So upstream gets ~1.45x faster going 16→64 threads, but STAR-suite stays flat (~740–760) — it's like the extra cores aren't being used. Wall clock matches the story: upstream 39m→27m, STAR-suite ~56m→55m. STAR-suite also used ~1.6x the peak RAM (~62 GB vs ~39 GB at 64t).

To be clear, correctness is great — the count matrices are essentially identical to upstream (per-gene Pearson 0.9998, barcode Jaccard ~0.97, <0.2% of nonzero entries differ), and output is bit-identical across thread counts. This is purely a throughput/threading thing.

The exact mapping command

STAR --runThreadN <T> --genomeDir <GRCh38_index> \
  --readFilesIn R2_L001,R2_L002 R1_L001,R1_L002 --readFilesCommand zcat \
  --soloType CB_UMI_Simple --soloCBwhitelist 3M-february-2018.txt \
  --soloCBstart 1 --soloCBlen 16 --soloUMIstart 17 --soloUMIlen 12 --soloBarcodeReadLength 0 \
  --soloFeatures GeneFull --soloStrand Forward \
  --clipAdapterType CellRanger4 --outFilterScoreMin 30 \
  --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
  --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR \
  --soloCellFilter EmptyDrops_CR --outSAMtype None \
  --dynamicThreadInterface 1 --dynamicThreadConstMapPermits <T> \
  --outFileNamePrefix <out>/

Questions

  1. Is flat thread-scaling expected for plain GEX STARsolo, or should it scale like upstream? (I get that the headline speedups are for perturb-seq / the CR-compat path — just trying to understand the GEX case.)
  2. Am I missing a flag to actually get the parallel speedup? Is --dynamicThreadConstMapPermits <threads> the right value, or should it be set differently relative to --runThreadN?
  3. Could this be my from-source build (default -O3, no -march=native, WITH_CHROMAP=0) rather than the algorithm? Is there a recommended/optimized build recipe (or an official prebuilt binary) I should use for a fair speed comparison?

Happy to share full logs / Log.final.out for any of the runs. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions