bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (samtools/bcftools#2557)#2011
Open
carstenerickson wants to merge 3 commits into
Open
Conversation
bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (#2557)bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (samtools/bcftools#2557)
bcftools view -R FILE.bed.gz exhibits 300-500x slowdown vs a sequential scan when the BED contains many densely-clustered single-base regions -- a common pattern in PRS / ancestry pipelines (HGDP+1kGP 84M, AADR 1240k, PGS Catalog). A production run on 84M such regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU. Root cause: bcf_sr_set_regions(is_file=1) routes through _readers_next_region() which issues one tbx_itr_queryi() per BED entry -- 84M iterator setup/teardown cycles for an 84M-entry panel. This patch adds a sniffer in bcf_sr_set_regions(): when the BED's first 256 non-comment entries are all single-base regions, internally promote to bcf_sr_set_targets(), which streams the VCF top-to-bottom and filters via in-memory regidx. Existing well-trodden code path; no new hot-path logic. The 0/1/2 --regions-overlap semantics match --targets-overlap so the user value carries over unchanged. Measured speedup against a synthetic reproducer (single-base BED at ~10bp avg spacing, matching VCF; bcftools 1.23.1, macOS arm64): N=100K: 13.3s -> 0.19s (70x) N=1M : 156.6s -> 0.55s (285x) N=5M : 705s -> 2.76s (255x) The 84M production case extrapolates from hours to minutes. Regression test: test_bcf_sr_regions_fastpath asserts -R FILE and -T FILE produce identical output on a 300-entry single-base fixture (>256 triggers the sniffer). test-bcf-sr gains -R/-T file-mode flags to exercise bcf_sr_set_regions/set_targets is_file=1 paths. Tracks samtools/bcftools#2557.
Four follow-ups to the auto-promotion patch: 1. API state leak (normal). The fastpath called bcf_sr_set_targets() unconditionally, which populates readers->targets while leaving readers->regions NULL and explicit_regs=0 -- both observable to callers. A subsequent bcf_sr_set_targets() then fails (e.g. for the -R FILE -T OTHER workflow); downstream consumers inspecting explicit_regs/regions saw "no regions set" even though set_regions succeeded. Whether identical caller code worked depended on the BED contents (the sniffer's content-dependent dispatch). Fix: gate the fastpath behind a new BCF_SR_AUTO_TARGETS_FROM_REGIONS opt. Default off, so existing callers keep the documented bcf_sr_set_regions() semantics. Callers that opt in accept the incompatibility with bcf_sr_set_targets() -- documented at the enum. 2. Sparse-BED regression (normal). The lone 256-line count threshold accepted 256 SNPs scattered genome-wide, which (compounded with samtools#1's explicit_regs=0) flipped ~256 cheap tabix seeks into full per-chrom streaming scans on whole-genome BCFs. Sniffer now also requires average intra-chromosome inter-entry distance below 10 kbp; 84M/1240k panels pass, scattered 256-entry BEDs fail. 3. FIFO consume-prefix (nit). The sniffer's hts_open + read + close permanently drained the first 256 lines from unseekable inputs. Sniffer now gates on hisremote() / '-' / stat()+S_ISREG() before opening, so FIFOs, stdin and char devices skip the sniff entirely. URLs are also skipped to avoid an extra round-trip. 4. Tautological regression test (nit). The prior test diffed -R against -T, but the fastpath made -R literally call bcf_sr_set_targets() -- the same function -T invokes -- so a defect in set_targets would shift both sides identically and the diff would still pass. Test now diffs three genuinely different code paths: -R slow path (per-region tbx_itr_queryi), -R fastpath (sniff + set_targets via opt), and -T direct. test-bcf-sr gains --auto-targets-from-regions to drive the new opt. Tracks samtools/bcftools#2557.
142224b to
deec4fa
Compare
Matches the style established by 6e5a94b ("Log error when hts_close() fails in synced reader cleanup"), which is the most recent contribution to this file. Our sniffer is read-only and local-files-only (we gate on S_ISREG before opening), so close errors are exceedingly rare in practice -- but the cost is two lines and the precedent is in this same file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bcf_sr_set_regions(file=1)runs onetbx_itr_queryi()per BED entry, which is 300-500× slower than a sequential scan at SNP-panel sizes — samtools/bcftools#2557 was a production case where 84M single-base regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU. The streaming-targets path already handles this workload at near-baseline speed; this PR adds auto-routing to it.Design
New
BCF_SR_AUTO_TARGETS_FROM_REGIONSopt (end ofbcf_sr_opt_t, ABI-safe; default off). When set,bcf_sr_set_regions()runs a sniffer that accepts only when ALL hold:bcf_sr_regions_init()reopens).END − START == 1).On accept,
regions_overlapis copied totargets_overlap(value semantics are identical) andbcf_sr_set_targets()takes over. On reject (sparse / wide / malformed / too small / non-local), the existing seek-per-region path runs unchanged.The opt-in gate is the safety mechanism: successful promotion populates
readers->targets, so a subsequentbcf_sr_set_targets()fails — callers must explicitly accept that trade-off.Measurements
bcftools 1.23.1 + this PR; single-base BED at 10 bp avg spacing, matching VCF; macOS arm64:
Production (84M entries × 23 GB VCF) extrapolates from hours to minutes.
Test plan
New
test_bcf_sr_regions_fastpath(test/test.pl) diffs three genuinely different code paths producing identical output on a 300-entry single-base fixture:-R FILEslow path (opt off)-R FILE --auto-targets-from-regionsfastpath (opt + sniffer)-T FILEexisting streaming-targets pathtest-bcf-srgains matching-R/-Tfile-mode flags and the--auto-targets-from-regionsflag. All 359 existing htslib tests pass.Notes
The matching bcftools-side wiring is at samtools/bcftools#2561; without it,
bcftools view -R FILEkeeps the slow path. This PR is strictly the htslib half.Tracks samtools/bcftools#2557.