Skip to content

bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (samtools/bcftools#2557)#2011

Open
carstenerickson wants to merge 3 commits into
samtools:developfrom
carstenerickson:fix/view-R-region-coalescing-2557
Open

bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (samtools/bcftools#2557)#2011
carstenerickson wants to merge 3 commits into
samtools:developfrom
carstenerickson:fix/view-R-region-coalescing-2557

Conversation

@carstenerickson
Copy link
Copy Markdown

@carstenerickson carstenerickson commented May 16, 2026

Summary

bcf_sr_set_regions(file=1) runs one tbx_itr_queryi() per BED entry, which is 300-500× slower than a sequential scan at SNP-panel sizes — samtools/bcftools#2557 was a production case where 84M single-base regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU. The streaming-targets path already handles this workload at near-baseline speed; this PR adds auto-routing to it.

Design

New BCF_SR_AUTO_TARGETS_FROM_REGIONS opt (end of bcf_sr_opt_t, ABI-safe; default off). When set, bcf_sr_set_regions() runs a sniffer that accepts only when ALL hold:

  • The path is a regular local file (FIFO/stdin/URL skip the sniff so we never consume the prefix before bcf_sr_regions_init() reopens).
  • First 256 non-comment entries are all single-base (END − START == 1).
  • Average intra-chromosome inter-entry distance < 10 kbp.

On accept, regions_overlap is copied to targets_overlap (value semantics are identical) and bcf_sr_set_targets() takes over. On reject (sparse / wide / malformed / too small / non-local), the existing seek-per-region path runs unchanged.

The opt-in gate is the safety mechanism: successful promotion populates readers->targets, so a subsequent bcf_sr_set_targets() fails — callers must explicitly accept that trade-off.

Measurements

bcftools 1.23.1 + this PR; single-base BED at 10 bp avg spacing, matching VCF; macOS arm64:

N seek-per-region fastpath speedup
100K 13.3 s 0.19 s 70×
1M 156.6 s 0.55 s 285×
5M 705 s 2.76 s 255×

Production (84M entries × 23 GB VCF) extrapolates from hours to minutes.

Test plan

New test_bcf_sr_regions_fastpath (test/test.pl) diffs three genuinely different code paths producing identical output on a 300-entry single-base fixture:

  • -R FILE slow path (opt off)
  • -R FILE --auto-targets-from-regions fastpath (opt + sniffer)
  • -T FILE existing streaming-targets path

test-bcf-sr gains matching -R/-T file-mode flags and the --auto-targets-from-regions flag. All 359 existing htslib tests pass.

Notes

The matching bcftools-side wiring is at samtools/bcftools#2561; without it, bcftools view -R FILE keeps the slow path. This PR is strictly the htslib half.

Tracks samtools/bcftools#2557.

@carstenerickson carstenerickson changed the title bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (#2557) bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (samtools/bcftools#2557) May 16, 2026
@carstenerickson carstenerickson marked this pull request as ready for review May 16, 2026 19:21
bcftools view -R FILE.bed.gz exhibits 300-500x slowdown vs a sequential
scan when the BED contains many densely-clustered single-base regions
-- a common pattern in PRS / ancestry pipelines (HGDP+1kGP 84M, AADR
1240k, PGS Catalog). A production run on 84M such regions against a
23 GB VCF did not complete in 11+ hours of 100% CPU.

Root cause: bcf_sr_set_regions(is_file=1) routes through
_readers_next_region() which issues one tbx_itr_queryi() per BED entry
-- 84M iterator setup/teardown cycles for an 84M-entry panel.

This patch adds a sniffer in bcf_sr_set_regions(): when the BED's
first 256 non-comment entries are all single-base regions, internally
promote to bcf_sr_set_targets(), which streams the VCF top-to-bottom
and filters via in-memory regidx. Existing well-trodden code path;
no new hot-path logic. The 0/1/2 --regions-overlap semantics match
--targets-overlap so the user value carries over unchanged.

Measured speedup against a synthetic reproducer (single-base BED at
~10bp avg spacing, matching VCF; bcftools 1.23.1, macOS arm64):

  N=100K: 13.3s  -> 0.19s   (70x)
  N=1M  : 156.6s -> 0.55s  (285x)
  N=5M  : 705s   -> 2.76s  (255x)

The 84M production case extrapolates from hours to minutes.

Regression test: test_bcf_sr_regions_fastpath asserts -R FILE and
-T FILE produce identical output on a 300-entry single-base fixture
(>256 triggers the sniffer). test-bcf-sr gains -R/-T file-mode flags
to exercise bcf_sr_set_regions/set_targets is_file=1 paths.

Tracks samtools/bcftools#2557.
Four follow-ups to the auto-promotion patch:

1. API state leak (normal). The fastpath called bcf_sr_set_targets()
   unconditionally, which populates readers->targets while leaving
   readers->regions NULL and explicit_regs=0 -- both observable to
   callers. A subsequent bcf_sr_set_targets() then fails (e.g. for the
   -R FILE -T OTHER workflow); downstream consumers inspecting
   explicit_regs/regions saw "no regions set" even though set_regions
   succeeded. Whether identical caller code worked depended on the BED
   contents (the sniffer's content-dependent dispatch).

   Fix: gate the fastpath behind a new BCF_SR_AUTO_TARGETS_FROM_REGIONS
   opt. Default off, so existing callers keep the documented
   bcf_sr_set_regions() semantics. Callers that opt in accept the
   incompatibility with bcf_sr_set_targets() -- documented at the enum.

2. Sparse-BED regression (normal). The lone 256-line count threshold
   accepted 256 SNPs scattered genome-wide, which (compounded with samtools#1's
   explicit_regs=0) flipped ~256 cheap tabix seeks into full per-chrom
   streaming scans on whole-genome BCFs. Sniffer now also requires
   average intra-chromosome inter-entry distance below 10 kbp;
   84M/1240k panels pass, scattered 256-entry BEDs fail.

3. FIFO consume-prefix (nit). The sniffer's hts_open + read + close
   permanently drained the first 256 lines from unseekable inputs.
   Sniffer now gates on hisremote() / '-' / stat()+S_ISREG() before
   opening, so FIFOs, stdin and char devices skip the sniff entirely.
   URLs are also skipped to avoid an extra round-trip.

4. Tautological regression test (nit). The prior test diffed -R against
   -T, but the fastpath made -R literally call bcf_sr_set_targets() --
   the same function -T invokes -- so a defect in set_targets would
   shift both sides identically and the diff would still pass. Test
   now diffs three genuinely different code paths: -R slow path
   (per-region tbx_itr_queryi), -R fastpath (sniff + set_targets via
   opt), and -T direct. test-bcf-sr gains --auto-targets-from-regions
   to drive the new opt.

Tracks samtools/bcftools#2557.
@carstenerickson carstenerickson force-pushed the fix/view-R-region-coalescing-2557 branch from 142224b to deec4fa Compare May 17, 2026 19:01
Matches the style established by 6e5a94b ("Log error when hts_close()
fails in synced reader cleanup"), which is the most recent contribution
to this file. Our sniffer is read-only and local-files-only (we gate on
S_ISREG before opening), so close errors are exceedingly rare in practice
-- but the cost is two lines and the precedent is in this same file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant