A fast, parallel k-mer counter for DNA sequences in FASTA files.
- Fast parallel processing using rayon and dashmap
- Canonical k-mers - outputs the lexicographically smaller of each k-mer and its reverse complement
- Flexible k-mer lengths from 1 to 32
- Handles N bases by skipping invalid k-mers
- Jellyfish-compatible output format for easy integration with existing pipelines
- Tested for accuracy against Jellyfish
cargo install kmerustgit clone https://github.com/suchapalaver/kmerust.git
cd kmerust
cargo install --path .kmerust <k> <path><k>- K-mer length (1-32)<path>- Path to a FASTA file
-h, --help- Print help information-V, --version- Print version information
Count 21-mers in a FASTA file:
kmerust 21 sequences.fa > kmers.txtCount 5-mers:
kmerust 5 sequences.fa > kmers.txtkmerust supports two FASTA readers via feature flags:
rust-bio(default) - Uses the rust-bio libraryneedletail- Uses the needletail library
To use needletail instead:
cargo run --release --no-default-features --features needletail -- 21 sequences.faEnable production features for additional capabilities:
cargo build --release --features productionOr enable individual features:
gzip- Read gzip-compressed FASTA files (.fa.gz)mmap- Memory-mapped I/O for large filestracing- Structured logging and diagnostics
With the gzip feature, kmerust can directly read gzip-compressed files:
cargo run --release --features gzip -- 21 sequences.fa.gzWith the tracing feature, use the RUST_LOG environment variable for diagnostic output:
RUST_LOG=kmerust=debug cargo run --features tracing -- 21 sequences.faOutput is written to stdout in FASTA-like format:
>{count}
{canonical_kmer}
Example output:
>114928
ATGCC
>289495
AATCA
kmerust can also be used as a library:
use kmerust::run::count_kmers;
use std::path::PathBuf;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let path = PathBuf::from("sequences.fa");
let counts = count_kmers(&path, 21)?;
for (kmer, count) in counts {
println!("{kmer}: {count}");
}
Ok(())
}Monitor progress during long-running operations:
use kmerust::run::count_kmers_with_progress;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_with_progress("genome.fa", 21, |progress| {
eprintln!(
"Processed {} sequences ({} bases)",
progress.sequences_processed,
progress.bases_processed
);
})?;
Ok(())
}For large files, use memory-mapped I/O (requires mmap feature):
use kmerust::run::count_kmers_mmap;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_mmap("large_genome.fa", 21)?;
println!("Found {} unique k-mers", counts.len());
Ok(())
}For memory-efficient processing:
use kmerust::streaming::count_kmers_streaming;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let counts = count_kmers_streaming("genome.fa", 21)?;
println!("Found {} unique k-mers", counts.len());
Ok(())
}kmerust uses parallel processing to efficiently count k-mers:
- Sequences are processed in parallel using rayon
- A concurrent hash map (dashmap) allows lock-free updates
- FxHash provides fast hashing for 64-bit packed k-mers
MIT License - see LICENSE for details.