A modular, GPU-accelerated audio processing pipeline that automates the transcription of spoken audio using NVIDIA NeMo's state-of-the-art ASR models.
- π GPU-Accelerated ASR using NVIDIA NeMo conformer models
- π€ Speaker Diarization for identifying who spoke when
- π Batch Processing of multiple audio files
- π― High Accuracy with confidence scoring and custom vocabulary support
- π Custom Vocabulary for domain-specific terms and corrections
- π Advanced Decoding with beam search and contextual biasing
- π Structured Output in JSON and human-readable text formats
- βοΈ Configurable Pipeline with YAML configuration
- π Modular Architecture for easy extension and maintenance
- π§ͺ Comprehensive Testing with pytest suite
- πΎ Smart Caching to avoid re-processing files
- π΅ Multiple Format Support for .wav, .mp3, .m4a, and .flac files
The pipeline follows a diarization-first approach with 5 processing stages:
βββββββββββββββ ββββββββββββββββ ββββββββββββββ-β βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β load_audio β -> β diarize β -> β transcribe β -> β format β -> β write_output β -> β Results β
β β β β β β β β β β β β
β β’ Load audioβ β β’ Speaker β β β’ NVIDIA β β β’ Structure β β β’ JSON files β β β’ Per-file β
β β’ Validate β β detection β β NeMo ASR β β β’ Timestampsβ β β’ TXT files β β outputs β
β β’ Resample β β β’ Segments β β β’ GPU accel β β β’ Confidenceβ β β’ Attributed β β β’ Summaries β
β β β β’ Clustering β β β’ Per-speakerβ β β’ Speakers β β TXT files β β β
βββββββββββββββ ββββββββββββββββ ββββββββββββββ-β βββββββββββββββ ββββββββββββββββ βββββββββββββββ
The pipeline generates multiple output formats for each processed audio file:
Structured data with complete metadata, timestamps, and confidence scores:
{
"audio_file": {
"path": "audio.wav",
"duration": 45.2,
"sample_rate": 16000
},
"transcription": {
"full_text": "Hello there! How are you today?",
"segments": [
{
"text": "Hello there!",
"start_time": 0.0,
"end_time": 1.5,
"confidence": 0.95,
"speaker_id": "SPEAKER_00"
}
]
}
}Formatted report with statistics and detailed segment breakdown.
New Feature! Dialog format for conversations with speaker labels:
SPEAKER_00: Hello there! How are you doing today?
SPEAKER_01: I'm doing great, thanks for asking!
SPEAKER_00: That's wonderful to hear.
This format maintains natural conversation flow and is perfect for:
- Meeting transcripts
- Interview recordings
- Podcast dialog
- Conference call notes
- Python 3.12+
- CUDA 12.8 (for GPU acceleration)
- NVIDIA RTX Titan or compatible GPU (recommended)
- uv package manager
First, install uv if you haven't already:
# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or using pipx (recommended if you have it)
pipx install uv
# Or as a fallback with pip
pip install uv# Clone the repository
git clone <your-repo-url>
cd audio_aigented
# Install the package and dependencies
uv pip install -e .
# For development (includes testing tools)
uv pip install -e ".[dev]"The pipeline requires NVIDIA NeMo for ASR functionality:
uv pip install "nemo-toolkit[asr]"# Process audio files in ./inputs/ directory (supports .wav, .mp3, .m4a, .flac)
uv run python main.py --input-dir ./inputs
# Use custom output directory
uv run python main.py --input-dir ./my_audio --output-dir ./my_results
# Use CPU instead of GPU
uv run python main.py --input-dir ./inputs --device cpu
# Disable speaker diarization (faster processing)
uv run python main.py --input-dir ./inputs --disable-diarization
# Use faster Parakeet model
uv run python main.py --input-dir ./inputs --model-name nvidia/parakeet-tdt-0.6b-v2
# Clear cache before processing
uv run python main.py --input-dir ./inputs --clear-cache
# Clear cache only (no processing)
uv run python main.py --clear-cache
# Create context template files for your audio
uv run python main.py --input-dir ./inputs --create-context-templates
# Show help for all options
uv run python main.py --helpfrom pathlib import Path
from src.audio_aigented.pipeline import TranscriptionPipeline
# Initialize pipeline with default settings
pipeline = TranscriptionPipeline()
# Process all files in input directory
results = pipeline.process_directory(Path("./inputs"))
# Process a single file (supports .wav, .mp3, .m4a, .flac)
result = pipeline.process_single_file(Path("./inputs/meeting.mp3"))
print(f"Transcription: {result.full_text}")Create a custom configuration file:
# config/my_config.yaml
input_dir: "./my_audio_files"
output_dir: "./my_results"
audio:
sample_rate: 16000
batch_size: 4
transcription:
model_name: "stt_en_conformer_ctc_large"
device: "cuda"
enable_confidence_scores: true
processing:
enable_diarization: true
enable_caching: true
log_level: "INFO"
output:
formats: ["json", "txt", "attributed_txt"]
include_timestamps: true
pretty_json: trueUse with CLI:
uv run python main.py --config ./config/my_config.yamlinputs/
βββ meeting_recording.wav
βββ interview_audio.wav
βββ conference_call.wav
outputs/
βββ meeting_recording/
β βββ transcript.json # Structured data with timestamps & speakers
β βββ transcript.txt # Human-readable format with statistics
β βββ transcript_attributed.txt # Theater-style speaker dialog
βββ interview_audio/
β βββ transcript.json
β βββ transcript.txt
β βββ transcript_attributed.txt
βββ processing_summary.txt # Overall processing report
{
"audio_file": {
"path": "./inputs/meeting.wav",
"duration": 125.3,
"sample_rate": 16000
},
"transcription": {
"full_text": "Good morning everyone, let's begin the meeting...",
"segments": [
{
"text": "Good morning everyone",
"start_time": 0.0,
"end_time": 2.1,
"confidence": 0.95
}
]
},
"processing": {
"processing_time": 12.4,
"model_info": {"name": "stt_en_conformer_ctc_large"}
}
}Provide custom vocabulary, speaker names, and corrections for individual audio files to dramatically improve transcription accuracy.
uv run python main.py --input-dir ./inputs --create-context-templatesThis creates .context.json files for each audio file with the following structure:
{
"vocabulary": ["technical_term1", "product_name"],
"corrections": {
"mistranscribed": "correct_term"
},
"speakers": {
"SPEAKER_00": "John Smith",
"SPEAKER_01": "Jane Doe"
},
"topic": "Meeting about AI implementation",
"acronyms": {
"AI": "Artificial Intelligence",
"ROI": "Return on Investment"
},
"phrases": ["machine learning pipeline", "quarterly targets"],
"notes": "Q4 planning meeting with technical discussion"
}The pipeline looks for context in order of priority:
audio.wav.context.json- JSON sidecar fileaudio.wav.txt- Simple vocabulary list (one term per line).context/audio.json- Centralized context directory
For terms common across all files:
uv run python main.py --vocabulary-file ./technical_terms.txtFormat:
# Technical terms
neural_network
kubernetes
microservices
# Corrections
kube -> Kubernetes
ml ops -> MLOps
# Acronyms
K8S:Kubernetes
API:Application Programming Interface
# Phrases
"continuous integration pipeline"
"infrastructure as code"
- Improved Accuracy: Domain-specific terms are recognized correctly
- Speaker Attribution: Replace generic IDs with actual names
- Corrections: Fix systematic transcription errors
- Acronym Expansion: Automatically expand technical acronyms
Extract context from meeting agendas, documentation, presentations, or any related text/HTML files:
# Single content file
uv run python main.py --input-dir ./inputs --content-file meeting_agenda.html
# Multiple content files
uv run python main.py --input-dir ./inputs \
--content-file agenda.html \
--content-file technical_spec.md \
--content-file presentation.txt
# Directory of content files
uv run python main.py --input-dir ./inputs --content-dir ./meeting_materialsPlace companion content files next to audio files:
meeting.wavβmeeting.wav.content.txt(or.html,.md)presentation.wavβpresentation.wav.content.html
The pipeline automatically detects and uses these companion files.
- Technical Terms: CamelCase, snake_case, hyphenated terms
- Acronyms: Automatically detected with expansions
- Proper Names: People, products, companies
- Key Phrases: Frequently mentioned multi-word terms
- Identifiers: Version numbers, ticket IDs, codes
Example extracted context:
{
"vocabulary": ["kubernetes", "langchain", "embeddings", "gpt-4-turbo"],
"acronyms": {
"LLM": "Large Language Model",
"RAG": "Retrieval Augmented Generation",
"ROI": "Return on Investment"
},
"phrases": ["vector database", "machine learning pipeline"],
"topic": "AI Strategy Meeting - Q4 2024"
}sample_rate: Target sample rate (default: 16000 Hz)batch_size: Number of files to process in parallelmax_duration: Maximum segment duration for processing
model_name: NVIDIA NeMo model to usenvidia/parakeet-tdt-0.6b-v2(fastest, transducer-based model)stt_en_conformer_ctc_small(fast, lower accuracy)stt_en_conformer_ctc_medium(balanced)stt_en_conformer_ctc_large(slow, higher accuracy)
device: Processing device (cudaorcpu)enable_confidence_scores: Include confidence scores in output
enable_diarization: Enable/disable speaker identification (default:true)- Command line:
--enable-diarization/--disable-diarization - When enabled, segments are automatically labeled with speaker IDs (SPEAKER_00, SPEAKER_01, etc.)
- Uses NVIDIA NeMo's clustering diarization for accurate speaker separation
enable_caching: Cache models and intermediate resultsparallel_workers: Number of parallel processing workerslog_level: Logging verbosity (DEBUG, INFO, WARNING, ERROR)--clear-cache: Clear cached transcription results before processing- Use alone to clear cache without processing:
uv run python main.py --clear-cache - Use with input directory to clear cache and process:
uv run python main.py --input-dir ./inputs --clear-cache
- Use alone to clear cache without processing:
formats: Output formats (["json", "txt", "attributed_txt"])include_timestamps: Include timing informationinclude_confidence: Include confidence scores in outputpretty_json: Format JSON with indentation
Run the comprehensive test suite:
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src --cov-report=term-missing
# Run specific test file
uv run pytest tests/test_models.py
# Run with verbose output
uv run pytest -v- Small model: ~15x real-time speed
- Medium model: ~8x real-time speed
- Large model: ~4x real-time speed
- Small model: ~1.5x real-time speed
- Medium model: ~0.8x real-time speed
- Large model: ~0.4x real-time speed
audio_aigented/
βββ src/audio_aigented/ # Main package
β βββ audio/ # Audio loading and preprocessing
β βββ transcription/ # ASR processing with NeMo
β βββ formatting/ # Output formatting
β βββ output/ # File writing
β βββ config/ # Configuration management
β βββ models/ # Pydantic data models
β βββ pipeline.py # Main orchestration
βββ tests/ # Test suite
βββ config/ # Configuration files
βββ examples/ # Usage examples
βββ main.py # CLI entry point
- Uses
rufffor linting and formatting - Follows PEP8 with type hints
- Google-style docstrings
- Maximum 500 lines per file
- Create feature branch
- Implement with comprehensive tests
- Update documentation
- Submit pull request
CUDA Out of Memory
# Use smaller batch size
uv run python main.py --input-dir ./inputs --config config/cpu_config.yaml
# Or switch to CPU
uv run python main.py --input-dir ./inputs --device cpuNo Audio Files Found
- Ensure
.wavfiles are in the input directory - Check file permissions
- Use
--dry-runto see what files would be processed
Model Download Issues
- NeMo models are downloaded automatically on first use
- Ensure internet connection for initial model download
- Models are cached in
~/.cache/torch/NeMo/
[Your License Here]
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Update documentation
- Submit a pull request
- Create an issue for bugs or feature requests
- Check existing issues before creating new ones
- Provide detailed information including error logs
Built with β€οΈ using NVIDIA NeMo and modern Python practices