Tools for managing a VCF Zarr store.
The VCF Zarr format enables efficient, scalable storage and analysis of large-scale genetic variation data.
VCF Zarr files (also known as VCZ files) can be created using bio2zarr. While ideal for one-off conversion of VCF files to Zarr, bio2zarr does not support updating a VCF Zarr with new samples.
Vczstore solves the update use case by providing the following operations
- Create a VCZ store with a known set of variants
- Append new samples to a VCZ store
- Remove samples from a VCZ store
The append operation works by appending a VCF Zarr file to the store (which is another VCF Zarr). The file being appended is created using bio2zarr.
Note that only new samples can be added - not new variants - so the store contains a fixed set of variants that must be known before it is created. This is usually not a limitation since the samples come from the same genotype array, or even from the same reference panel for imputed data.
If the source VCFs have different (but typically overlapping) sets of variants, then they need to be harmonised with the full set of variants before being converted to VCZ. This can be accomplished by running vczstore normalise before running append.
The append operation will perform a check that the contig, position and allele (REF and ALT) fields all match before performing the update. It will fail if there is a mismatch, so samples with inconsistent variant sets cannot be appended. The check is strict - allele ordering must match exactly too.
Multiallelic sites, and split alleles (mutiple records for a site) are both accepted, as long as the ordering is consistent for all source VCFs.
A VCZ store is just a VCZ file - typically in cloud object store - so it's possible to create one using bio2zarr (e.g. vcf2zarr). However, when the VCFs being appended contain different variants the store must be created with the full set of variants. This is achieved by merging a small number of VCFs that collectively define the set of variants (e.g. one for each genotype array), using a vcf-merge-stable tool that is similar to bcftools merge, but provides a stability guarantee that is required when working with Zarr.
After creation the store contains no samples.
New samples often arrive as single-sample VCFs. For efficiency, it is best to append in batches that correspond to the sample chunk size (default 10,000), but it is possible to append a smaller number of samples.
The single-sample VCFs are turned into a single VCF using bcftools merge. However, note that in this case the samples must all have the same set of variants. If there are VCFs with different variants (e.g. from different genotype arrays) then they must be appended in separate operations.
The vczstore normalise step ensures that the VCZ contains the same number of variants as the store (and in the same order), so the new sample data can simply be appended to the end of the genotype arrays.
When a sample is removed, all data for that sample is overwritten with missing values. This is performed directly on the store.
The implementation uses zarr-python (version 3) directly to update Zarr chunks in the store. Stores using Zarr format 2 and 3 are supported, as well as Icechunk storage. Using Icechunk means that updates are performed in a transaction so that other users accessing the store will see consistent updates.
| zarr-python (format) | v2 | v3 |
|---|---|---|
| no transactions | ✅ | ✅ |
| icechunk | N/A | ✅ |
VCF Zarr stores can reside in cloud stores such as Amazon S3 or Azure Cloud Storage.
For Icechunk-backed stores, local filesystem paths plus s3://, az://,
azure://, abfs://, abfss://, and Azure
https://...blob.core.windows.net/... or https://...dfs.core.windows.net/...
URLs are supported.
- Transactions: none
- Distributed: single process
- Zarr: format 2
% uv sync --group dev
# Create some VCZ data
% rm -rf data
% mkdir data
% uv run vcf2zarr convert --no-progress --samples-chunk-size=4 tests/data/vcf/sample-part1.vcf.gz data/store.vcz
% uv run vcf2zarr convert --no-progress tests/data/vcf/sample-part2.vcf.gz data/sample-part2.vcf.vcz
# Show the samples in each
% uv run vcztools query -l data/store.vcz
NA00001
NA00002
% uv run vcztools query -l data/sample-part2.vcf.vcz
NA00003
# Append data to the store
% uv run vczstore append data/store.vcz data/sample-part2.vcf.vcz
% uv run vcztools query -l data/store.vcz
NA00001
NA00002
NA00003
# Remove a sample from the store
% uv run vczstore remove data/store.vcz NA00002
% uv run vcztools query -l data/store.vcz
NA00001
NA00003- Transactions: icechunk
- Distributed: single process
- Zarr: format 3
% uv sync --extra icechunk
# Create some VCZ data
% rm -rf data
% mkdir data
% uv run vcf2zarr convert --no-progress --zarr-format=3 --samples-chunk-size=4 tests/data/vcf/sample-part1.vcf.gz data/sample-part1.vcf.vcz
% uv run vcf2zarr convert --no-progress --zarr-format=3 tests/data/vcf/sample-part2.vcf.gz data/sample-part2.vcf.vcz
# Copy first vcz to an icechunk store
% uv run vczstore copy-store-to-icechunk data/sample-part1.vcf.vcz data/store.vcz
% rm -rf data/sample-part1.vcf.vcz
# Show the samples in each
% uv run vcztools query -l data/store.vcz --zarr-backend-storage icechunk
NA00001
NA00002
% uv run vcztools query -l data/sample-part2.vcf.vcz
NA00003
# Append data to the store
% uv run vczstore append data/store.vcz data/sample-part2.vcf.vcz --zarr-backend-storage icechunk
% uv run vcztools query -l data/store.vcz --zarr-backend-storage icechunk
NA00001
NA00002
NA00003
# Remove a sample from the store
% uv run vczstore remove data/store.vcz NA00002 --zarr-backend-storage icechunk
% uv run vcztools query -l data/store.vcz --zarr-backend-storage icechunk
NA00001
NA00003