Skip to content

view: opt into htslib dense single-base BED auto-promotion (#2557)#2561

Open
carstenerickson wants to merge 1 commit into
samtools:developfrom
carstenerickson:fix/view-R-region-coalescing-2557
Open

view: opt into htslib dense single-base BED auto-promotion (#2557)#2561
carstenerickson wants to merge 1 commit into
samtools:developfrom
carstenerickson:fix/view-R-region-coalescing-2557

Conversation

@carstenerickson
Copy link
Copy Markdown

Summary

Wires bcftools view -R FILE into the new BCF_SR_AUTO_TARGETS_FROM_REGIONS htslib opt (samtools/htslib#2011). When -R FILE is given and no -T is set, view sets the opt before bcf_sr_set_regions(); htslib's sniffer decides per-file whether to promote a dense single-base BED to the streaming-targets path.

Design

condition behaviour
-R FILE, no -t/-T opt on (auto)
-R FILE -T OTHER opt off (avoid API state leak: sniffer would populate readers->targets, conflicting with set_targets)
-r REGION (string) opt off (sniffer requires a file)
-R FILE --no-regions-fastpath opt off (user escape hatch)

--no-regions-fastpath is the user-facing escape hatch for corner cases the sniffer accepts incorrectly.

Measurements

End-to-end via bcftools view -R; single-base BED at 10 bp avg spacing matching the VCF positions; macOS arm64:

N upstream 1.23.1 this PR speedup
100K 11.7 s 0.17 s 69×
1M 156.6 s 0.73 s 215×

The 1M case puts production HGDP+1kGP / AADR / PGS panel intersections into the minutes-not-hours regime via the most common entry point.

Test plan

test_vcf_view_regions_fastpath (test/test.pl) asserts byte-identical output across three paths over a 300-entry single-base BED fixture:

  • view -R FILE (fastpath; opt on)
  • view -R FILE --no-regions-fastpath (slow path preserved)
  • view -T FILE (control)

Existing tests pass unchanged.

Notes

Depends on samtools/htslib#2011; without that PR, the opt has no effect and view falls back to the slow path with no semantic change.

Tracks #2557.

…ls#2557)

Wires bcftools view into the new BCF_SR_AUTO_TARGETS_FROM_REGIONS htslib
opt (see samtools/htslib for the underlying sniffer and streaming-
targets routing).  When -R FILE is given and no -T is set, view sets
the opt before calling bcf_sr_set_regions(); htslib's sniffer then
decides per-file whether to promote.

Why gate on !targets_list: BCF_SR_AUTO_TARGETS_FROM_REGIONS populates
readers->targets, which conflicts with a subsequent bcf_sr_set_targets()
call.  Always-on would silently break the -R FILE -T OTHER workflow,
so the opt is suppressed whenever a -t/-T was given.  The new
--no-regions-fastpath suppresses the opt unconditionally for users
that hit corner cases the sniffer accepts incorrectly.

End-to-end measurements (synthetic single-base BED, 10bp avg spacing,
matching VCF; bcftools 1.23.1-dirty, macOS arm64):

  N=100K:  11.67s -> 0.17s   (69x)
  N=1M  : 156.60s -> 0.73s  (215x)

The N=1M result puts the production HGDP+1kGP / AADR / PGS panel cases
within minutes instead of hours.  --no-regions-fastpath, --regions-file
without an index, and -R FILE -T OTHER all preserve the existing
seek-per-region default.

Regression test test_vcf_view_regions_fastpath asserts byte-identical
output across the fastpath, --no-regions-fastpath, and -T paths on a
300-entry single-base fixture (>= htslib SNIFF_LINES=256 so the
sniffer accepts).

Tracks samtools#2557.
@carstenerickson carstenerickson force-pushed the fix/view-R-region-coalescing-2557 branch from b7faa5a to deb6990 Compare May 17, 2026 18:53
@carstenerickson
Copy link
Copy Markdown
Author

CI fixes pushed (force-push, post-rebase):

  • Rebased onto current develop to resolve the NEWS conflict; my bcftools view entry now sits under ## Release a.b in the alphabetic position after +trio-dnm3. No other diff vs. the previous push — the auto-merged vcfview.c and test/test.pl chunks were textually clean and the regression test still passes locally.

  • The Cirrus 'BCF_SR_AUTO_TARGETS_FROM_REGIONS' undeclared errors are expected until samtools/htslib#2011 merges. This PR's vcfview.c references the new opt added there; the bcftools submodule pin on develop points at upstream htslib which doesn't yet have the enum. The natural sequence is: land htslib#2011 → bump bcftools' htslib submodule pin → merge this PR. Happy to refresh the submodule bump as part of this PR once htslib#2011 is in, or leave it to whoever does the routine submodule sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant