Rust CLI for transcribing audio with a W2V-BERT frontend, an ONNX CTC acoustic model, SentencePiece decoding, and optional KenLM reranking.
The ONNX acoustic model may use either fp16 or fp32 tensors. Input and output precision are detected from the model metadata at load time.
- Rust 1.94 or newer
- Local model artifacts:
model_optimized.onnxtokenizer.model- optional KenLM model such as
lm.binaryornews-titles.arpafor reranking
The model files are ignored by git because they are large local artifacts.
cargo run --release -- <audio-file> [model.onnx] [tokenizer.model] [beam-width] [lm.binary] [lm-weight] [word-bonus]Defaults:
model.onnx model_optimized.onnx
tokenizer.model tokenizer.model
beam-width 32
n-best beam-width
lm.binary lm.binary
lm-weight 0.45
word-bonus 0.2
hot-word-bonus 0.0
If the default LM path does not exist, the CLI disables KenLM and decodes without LM reranking.
Example:
cargo run --release -- example_1.wavBy default ONNX Runtime uses CPU execution. Hardware execution providers are opt-in Cargo features:
cargo run --release --features coreml -- example_1.wav
cargo run --release --features cuda -- example_1.wavCoreML requires macOS. CUDA requires an ONNX Runtime CUDA build and compatible NVIDIA CUDA libraries.
By default, the matching ONNX Runtime binaries are downloaded at build time. To load an external ONNX Runtime dynamic library instead, build with ort-dynamic and set ORT_DYLIB_PATH or pass --ort-dylib:
ORT_DYLIB_PATH=/path/to/libonnxruntime.dylib cargo run --release --no-default-features --features ort-dynamic -- example_1.wav
cargo run --release --no-default-features --features ort-dynamic -- example_1.wav --ort-dylib /path/to/libonnxruntime.dylibPrint help:
cargo run -- --helpThe full pipeline is configurable from the CLI:
cargo run --release -- example_1.wav \
--ort-optimization level1 \
--fallback-sample-rate 16000 \
--strict-audio-decode \
--w2v-sample-rate 16000 \
--w2v-feature-size 80 \
--w2v-stride 2 \
--blank-id 0 \
--n-best 16 \
--no-normalize-spaces \
--no-accelerator-log \
--no-lm-log \
--hot-word "Київ" \
--hot-word "Іван Франко" \
--hot-word-bonus 2.0 \
--lm-no-bos \
--lm-no-eosThe package also includes a CTC-segmentation aligner based on the dynamic programming method from arXiv:2007.09127v2. It aligns an existing transcript to audio and prints tab-separated utterance segments:
cargo run --release --bin ctc-align -- audio.wav transcript.txt model_optimized.onnx tokenizer.model
cargo run --release --bin ctc-align -- audio.wav transcript.txt --output-format jsonl --output-file segments.jsonltranscript.txt should contain one utterance per non-empty line. Output columns
for the default TSV format are:
start end score text
The score is the minimum mean frame log-probability window from the paper. By
default the tool infers the CTC frame duration from audio duration / CTC frames; pass --index-duration if your model requires a fixed value. The
aligner uses the reference moving-window table fill to keep memory bounded by
window_size * transcript_tokens; tune --min-window-size and
--max-window-size for long recordings.
For Rust callers, TranscriptionConfig is split by processing stage:
use rust_asr::{
AcousticModelConfig, CtcDecoderConfig, DecoderConfig, EncoderConfig,
RuntimeConfig, TextDecoderConfig, TranscriptionConfig, W2vBertEncoderConfig,
audio::AudioDecodeConfig,
model::{ModelConfig, ModelOptimizationLevel},
};
let config = TranscriptionConfig {
runtime: RuntimeConfig { ort_dylib_path: None },
audio: AudioDecodeConfig {
fallback_sample_rate: 16_000,
skip_decode_errors: true,
},
encoder: EncoderConfig {
w2v_bert: W2vBertEncoderConfig {
sample_rate: Some(16_000),
feature_size: Some(80),
stride: Some(2),
..Default::default()
},
},
model: AcousticModelConfig {
path: "model_optimized.onnx".into(),
session: ModelConfig {
optimization_level: ModelOptimizationLevel::Disable,
log_accelerator: true,
},
},
decoder: DecoderConfig {
ctc: CtcDecoderConfig {
blank_id: 0,
beam_width: 32,
n_best: 32,
},
text: TextDecoderConfig {
tokenizer_path: "tokenizer.model".into(),
normalize_spaces: true,
drop_empty_candidates: true,
},
language_model: None,
},
};The crate can also be built as a PyO3 extension module for Python 3.10+. Use maturin to install the extension into the active Python environment:
uvx maturin develop --release --features pythonBuild with accelerators by combining features:
uvx maturin develop --release --features "python coreml"
uvx maturin develop --release --features "python cuda"CoreML is for macOS. CUDA requires a compatible NVIDIA CUDA runtime. If you use uv run, rebuild the environment after changing Rust/PyO3 signatures:
uv cache clean rust-asr
uv sync --reinstall-package rust-asrPython API:
from pathlib import Path
import rust_asr
# fp16 and fp32 ONNX acoustic models are both supported. The extension detects
# the model tensor precision when it loads the ONNX session.
# Convenience one-shot call. This initializes the model for this call.
text = rust_asr.transcribe_file(
"example_1.wav",
model="model_optimized.onnx",
tokenizer="tokenizer.model",
beam_width=32,
lm=None,
lm_weight=0.45,
word_bonus=0.2,
log_language_model=False,
ort_dylib_path=None,
ort_optimization="disable",
log_accelerator=True,
fallback_sample_rate=16000,
skip_decode_errors=True,
w2v_model_source=None,
w2v_sample_rate=16000,
w2v_feature_size=80,
w2v_stride=2,
w2v_feature_dim=None,
w2v_padding_value=None,
blank_id=0,
n_best=32,
normalize_spaces=True,
drop_empty_candidates=True,
lm_bos=True,
lm_eos=True,
)
report = rust_asr.transcribe_file_with_report(
"example_1.wav",
model="model_optimized.onnx",
tokenizer="tokenizer.model",
lm=None,
)
print(report["transcript"])
print(report["candidates"][0]["total_score"])
print(report["timings"]["model_inference_seconds"])
audio_bytes = Path("example_1.wav").read_bytes()
text_from_bytes = rust_asr.transcribe_bytes(
audio_bytes,
format_hint="wav",
model="model_optimized.onnx",
tokenizer="tokenizer.model",
lm=None,
)
# Reusable transcriber. The ONNX model session and tokenizer are initialized
# once and reused for each audio file.
transcriber = rust_asr.Transcriber(
model="model_optimized.onnx",
tokenizer="tokenizer.model",
beam_width=32,
lm="news-titles.arpa",
lm_weight=0.45,
word_bonus=0.2,
hot_words=["Київ", "Іван Франко"],
hot_word_bonus=2.0,
log_language_model=False,
ort_dylib_path=None,
ort_optimization="disable",
log_accelerator=True,
)
first = transcriber.transcribe_file("example_1.wav")
second = transcriber.transcribe_file("example_2.wav")
first_report = transcriber.transcribe_file_with_report("example_1.wav")
first_from_bytes = transcriber.transcribe_bytes(audio_bytes, format_hint="wav")Kotlin/JVM bindings are generated with UniFFI:
CARGO_PROFILE_RELEASE_STRIP=false cargo build --release --no-default-features --features kotlin,ort-dynamic --lib
cargo install uniffi --version 0.31.1 --locked --features cli --root target/uniffi-tools
target/uniffi-tools/bin/uniffi-bindgen generate target/release/librust_asr.so --language kotlin --out-dir kotlin/generated --no-formatThe generated Kotlin uses JNA to load rust_asr, so make the native library and ONNX Runtime discoverable at runtime.
Kotlin API:
import io.github.rustedbytes.rustasr.KotlinTranscriber
import io.github.rustedbytes.rustasr.defaultOptions
val options = defaultOptions().copy(
model = "model_optimized.onnx",
tokenizer = "tokenizer.model",
lm = "news-titles.arpa",
)
KotlinTranscriber(options).use { transcriber ->
val text = transcriber.transcribeFile("example_1.wav")
println(text)
}The crate can be built as a Go cgo package by enabling the go feature. The
Go package uses the generated C ABI from cbindgen, so the native library,
c/rust_asr.h, and the go/ package must be kept together:
cargo build --release --no-default-features --features go,ort-dynamic --lib
mkdir -p native
cp target/release/librust_asr.so native/
go test ./go
go run ./examples/transcribe.go example_1.wavGo API:
package main
import (
"fmt"
"log"
rustasr "github.com/RustedBytes/rust-asr/go"
)
func main() {
transcriber, err := rustasr.NewTranscriber(rustasr.Options{
Model: "model_optimized.onnx",
Tokenizer: "tokenizer.model",
LM: "news-titles.arpa",
BeamWidth: 32,
})
if err != nil {
log.Fatal(err)
}
defer transcriber.Close()
text, err := transcriber.TranscribeFile("example_1.wav")
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}The crate can be built as a Node.js 16+ native extension through Node-API and napi-rs:
npm install
npm run build:nodejs -- --platform artifactsThe generated extension can be loaded from the output directory:
const w2vBertUk = require("./artifacts");
const text = w2vBertUk.transcribeFile("example_1.wav", {
model: "model_optimized.onnx",
tokenizer: "tokenizer.model",
lm: null,
beamWidth: 32,
ortOptimization: "disable",
fallbackSampleRate: 16000,
skipDecodeErrors: true,
});
const report = w2vBertUk.transcribeFileWithReport("example_1.wav", {
model: "model_optimized.onnx",
tokenizer: "tokenizer.model",
lm: null,
});
const transcriber = new w2vBertUk.Transcriber({
model: "model_optimized.onnx",
tokenizer: "tokenizer.model",
lm: "news-titles.arpa",
});
const reused = transcriber.transcribeFile("example_2.wav");
console.log(text, report.timings.modelInferenceSeconds, reused);The Node.js build uses ort-dynamic, so set ORT_DYLIB_PATH before loading the extension when ONNX Runtime is not discoverable by the system loader.
Rust:
cargo run --example transcribe -- example_1.wavPython:
uvx maturin develop --release --features python
uv run python examples/transcribe.pyNode.js:
npm run build:nodejs -- --platform artifacts
node examples/transcribe.jsKotlin:
CARGO_PROFILE_RELEASE_STRIP=false cargo build --release --no-default-features --features kotlin,ort-dynamic --lib
target/uniffi-tools/bin/uniffi-bindgen generate target/release/librust_asr.so --language kotlin --out-dir kotlin/generated --no-format
# Compile examples/Transcribe.kt together with kotlin/generated/**/*.kt and
# include JNA on the Kotlin/JVM classpath.Swift:
cargo build --release --no-default-features --features swift,ort-dynamic --lib
# Compile examples/transcribe.swift together with swift/generated/*.swift and
# swift/generated/rust-asr/*.swift, and link it against the native library.C#:
cargo build --release --no-default-features --features csharp,ort-dynamic --lib
# Compile examples/Transcribe.cs together with csharp/NativeMethods.g.cs,
# enabling unsafe code and making the native library discoverable at runtime.C and C++:
cargo build --release --no-default-features --features c,cpp,ort-dynamic --lib
cc -Ic -c examples/transcribe.c -o c-smoke.o
c++ -Icpp -std=c++17 -c examples/transcribe.cpp -o cpp-smoke.o
# Link your application against the generated native library and make ONNX
# Runtime discoverable at runtime when using ort-dynamic.The GitHub Actions workflow in .github/workflows/python-bindings.yml builds Python wheels on:
ubuntu-22.04aslinux-x86_64macos-latestasmacos-arm64windows-latestaswindows-x86_64
The Linux wheel is built with ort-dynamic because the current bundled ONNX Runtime Linux binary requires newer glibc symbols than common Python runners provide. Use ORT_DYLIB_PATH or ort_dylib_path to point it at a compatible ONNX Runtime shared library at runtime.
Each job installs the wheel and runs an import smoke test before uploading the wheel artifact. Tag creation also uploads the wheels to the matching GitHub Release.
Build a wheel locally:
uvx maturin build --release --features python
uvx maturin build --release --features "python coreml"
uvx maturin build --release --features "python cuda"The GitHub Actions workflow in .github/workflows/nodejs-bindings.yml builds Node.js .node extensions on:
ubuntu-22.04aslinux-x64-gnumacos-latestasmacos-arm64windows-latestaswindows-x64-msvc
Each job loads the extension in Node.js 16 before uploading the platform artifact. Tag creation also uploads the extensions to the matching GitHub Release.
The GitHub Actions workflow in .github/workflows/kotlin-bindings.yml builds the shared native library and UniFFI-generated Kotlin/JVM bindings on:
ubuntu-22.04aslinux-x64-gnumacos-latestasmacos-arm64windows-latestaswindows-x64-msvc
Each job builds with ort-dynamic, generates Kotlin from the native library, and uploads a zip with the native library, generated Kotlin sources, UniFFI config, and example. Tag creation also uploads the zip files to the matching GitHub Release.
The GitHub Actions workflow in .github/workflows/c-cpp-bindings.yml builds the shared native library and generated cbindgen headers on:
ubuntu-22.04aslinux-x64-gnumacos-latestasmacos-arm64windows-latestaswindows-x64-msvc
Each job compiles C and C++ header smoke tests before uploading a zip with the native library, generated headers, and examples. Tag creation also uploads the zip files to the matching GitHub Release.
The transcript is printed to stdout. Timings and decoder diagnostics are printed to stderr, so you can redirect the transcript cleanly:
cargo run --release -- example_1.wav > transcript.txtTiming output includes audio duration, audio decode time, feature extraction time, ONNX session setup, inference, CTC beam search, KenLM reranking, total wall time, and real-time factor:
audio duration: 8.515s
audio decode: 16.578ms
feature extraction: 426.010ms
onnx inference: 2.333s
ctc beam search: 5.336s
kenlm rerank: 21.690ms
RTF/RFT: 1.158x
If an LM path is configured, the decoder reranks CTC N-best candidates using shallow fusion:
total = ctc_log_prob + lm_weight * lm_log_prob + word_bonus * word_count + hot_word_score
Candidates are scored by KenLM with their decoded casing preserved, so the language model should use casing that matches the tokenizer output.
Pass --hot-word <word-or-phrase> one or more times with --hot-word-bonus <score> to boost candidates that contain those hot words after whitespace normalization. Hot-word matching is case-insensitive and token-based.
Tune lm-weight, word-bonus, and hot-word-bonus on validation audio before using the defaults for evaluation.
cargo fmt --check
cargo checkThe CoreML feature uses static input shapes for CoreML subgraphs. Unsupported dynamic graph regions fall back through ONNX Runtime.