Skip to content

alanderex/audio_aigented

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

19 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈ Audio Transcription Pipeline

A modular, GPU-accelerated audio processing pipeline that automates the transcription of spoken audio using NVIDIA NeMo's state-of-the-art ASR models.

✨ Features

  • πŸš€ GPU-Accelerated ASR using NVIDIA NeMo conformer models
  • 🎀 Speaker Diarization for identifying who spoke when
  • πŸ“ Batch Processing of multiple audio files
  • 🎯 High Accuracy with confidence scoring and custom vocabulary support
  • πŸ“ Custom Vocabulary for domain-specific terms and corrections
  • πŸ” Advanced Decoding with beam search and contextual biasing
  • πŸ“Š Structured Output in JSON and human-readable text formats
  • βš™οΈ Configurable Pipeline with YAML configuration
  • πŸ”„ Modular Architecture for easy extension and maintenance
  • πŸ§ͺ Comprehensive Testing with pytest suite
  • πŸ’Ύ Smart Caching to avoid re-processing files
  • 🎡 Multiple Format Support for .wav, .mp3, .m4a, and .flac files

πŸ—οΈ Architecture

The pipeline follows a diarization-first approach with 5 processing stages:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€-┐    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ load_audio  β”‚ -> β”‚ diarize      β”‚ -> β”‚ transcribe   β”‚ -> β”‚ format      β”‚ -> β”‚ write_output β”‚ -> β”‚ Results     β”‚
β”‚             β”‚    β”‚              β”‚    β”‚              β”‚    β”‚             β”‚    β”‚              β”‚    β”‚             β”‚
β”‚ β€’ Load audioβ”‚    β”‚ β€’ Speaker    β”‚    β”‚ β€’ NVIDIA     β”‚    β”‚ β€’ Structure β”‚    β”‚ β€’ JSON files β”‚    β”‚ β€’ Per-file  β”‚
β”‚ β€’ Validate  β”‚    β”‚   detection  β”‚    β”‚   NeMo ASR   β”‚    β”‚ β€’ Timestampsβ”‚    β”‚ β€’ TXT files  β”‚    β”‚   outputs   β”‚
β”‚ β€’ Resample  β”‚    β”‚ β€’ Segments   β”‚    β”‚ β€’ GPU accel  β”‚    β”‚ β€’ Confidenceβ”‚    β”‚ β€’ Attributed β”‚    β”‚ β€’ Summaries β”‚
β”‚             β”‚    β”‚ β€’ Clustering β”‚    β”‚ β€’ Per-speakerβ”‚    β”‚ β€’ Speakers  β”‚    β”‚   TXT files  β”‚    β”‚             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    └─────────────-β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“„ Output Formats

The pipeline generates multiple output formats for each processed audio file:

1. JSON Format (transcript.json)

Structured data with complete metadata, timestamps, and confidence scores:

{
  "audio_file": {
    "path": "audio.wav",
    "duration": 45.2,
    "sample_rate": 16000
  },
  "transcription": {
    "full_text": "Hello there! How are you today?",
    "segments": [
      {
        "text": "Hello there!",
        "start_time": 0.0,
        "end_time": 1.5,
        "confidence": 0.95,
        "speaker_id": "SPEAKER_00"
      }
    ]
  }
}

2. Human-Readable Text (transcript.txt)

Formatted report with statistics and detailed segment breakdown.

3. Theater-Style Attribution (transcript_attributed.txt)

New Feature! Dialog format for conversations with speaker labels:

SPEAKER_00: Hello there! How are you doing today?
SPEAKER_01: I'm doing great, thanks for asking!
SPEAKER_00: That's wonderful to hear.

This format maintains natural conversation flow and is perfect for:

  • Meeting transcripts
  • Interview recordings
  • Podcast dialog
  • Conference call notes

πŸ“¦ Installation

Prerequisites

  • Python 3.12+
  • CUDA 12.8 (for GPU acceleration)
  • NVIDIA RTX Titan or compatible GPU (recommended)
  • uv package manager

Install uv

First, install uv if you haven't already:

# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or using pipx (recommended if you have it)
pipx install uv

# Or as a fallback with pip
pip install uv

Install Dependencies

# Clone the repository
git clone <your-repo-url>
cd audio_aigented

# Install the package and dependencies
uv pip install -e .

# For development (includes testing tools)
uv pip install -e ".[dev]"

NVIDIA NeMo Installation

The pipeline requires NVIDIA NeMo for ASR functionality:

uv pip install "nemo-toolkit[asr]"

πŸš€ Quick Start

1. Basic Usage with CLI

# Process audio files in ./inputs/ directory (supports .wav, .mp3, .m4a, .flac)
uv run python main.py --input-dir ./inputs

# Use custom output directory
uv run python main.py --input-dir ./my_audio --output-dir ./my_results

# Use CPU instead of GPU
uv run python main.py --input-dir ./inputs --device cpu

# Disable speaker diarization (faster processing)
uv run python main.py --input-dir ./inputs --disable-diarization

# Use faster Parakeet model
uv run python main.py --input-dir ./inputs --model-name nvidia/parakeet-tdt-0.6b-v2

# Clear cache before processing
uv run python main.py --input-dir ./inputs --clear-cache

# Clear cache only (no processing)
uv run python main.py --clear-cache

# Create context template files for your audio
uv run python main.py --input-dir ./inputs --create-context-templates

# Show help for all options
uv run python main.py --help

2. Using the Python API

from pathlib import Path
from src.audio_aigented.pipeline import TranscriptionPipeline

# Initialize pipeline with default settings
pipeline = TranscriptionPipeline()

# Process all files in input directory
results = pipeline.process_directory(Path("./inputs"))

# Process a single file (supports .wav, .mp3, .m4a, .flac)
result = pipeline.process_single_file(Path("./inputs/meeting.mp3"))
print(f"Transcription: {result.full_text}")

3. Custom Configuration

Create a custom configuration file:

# config/my_config.yaml
input_dir: "./my_audio_files"
output_dir: "./my_results"

audio:
  sample_rate: 16000
  batch_size: 4

transcription:
  model_name: "stt_en_conformer_ctc_large"
  device: "cuda"
  enable_confidence_scores: true

processing:
  enable_diarization: true
  enable_caching: true
  log_level: "INFO"

output:
  formats: ["json", "txt", "attributed_txt"]
  include_timestamps: true
  pretty_json: true

Use with CLI:

uv run python main.py --config ./config/my_config.yaml

πŸ“ Input and Output

Input Structure

inputs/
β”œβ”€β”€ meeting_recording.wav
β”œβ”€β”€ interview_audio.wav
└── conference_call.wav

Output Structure

outputs/
β”œβ”€β”€ meeting_recording/
β”‚   β”œβ”€β”€ transcript.json              # Structured data with timestamps & speakers
β”‚   β”œβ”€β”€ transcript.txt               # Human-readable format with statistics
β”‚   └── transcript_attributed.txt    # Theater-style speaker dialog
β”œβ”€β”€ interview_audio/
β”‚   β”œβ”€β”€ transcript.json
β”‚   β”œβ”€β”€ transcript.txt
β”‚   └── transcript_attributed.txt
└── processing_summary.txt           # Overall processing report

Sample JSON Output

{
  "audio_file": {
    "path": "./inputs/meeting.wav",
    "duration": 125.3,
    "sample_rate": 16000
  },
  "transcription": {
    "full_text": "Good morning everyone, let's begin the meeting...",
    "segments": [
      {
        "text": "Good morning everyone",
        "start_time": 0.0,
        "end_time": 2.1,
        "confidence": 0.95
      }
    ]
  },
  "processing": {
    "processing_time": 12.4,
    "model_info": {"name": "stt_en_conformer_ctc_large"}
  }
}

🎯 Context Enhancement Features

Per-File Context

Provide custom vocabulary, speaker names, and corrections for individual audio files to dramatically improve transcription accuracy.

1. Create Context Templates

uv run python main.py --input-dir ./inputs --create-context-templates

This creates .context.json files for each audio file with the following structure:

{
  "vocabulary": ["technical_term1", "product_name"],
  "corrections": {
    "mistranscribed": "correct_term"
  },
  "speakers": {
    "SPEAKER_00": "John Smith",
    "SPEAKER_01": "Jane Doe"
  },
  "topic": "Meeting about AI implementation",
  "acronyms": {
    "AI": "Artificial Intelligence",
    "ROI": "Return on Investment"
  },
  "phrases": ["machine learning pipeline", "quarterly targets"],
  "notes": "Q4 planning meeting with technical discussion"
}

2. Context File Locations

The pipeline looks for context in order of priority:

  1. audio.wav.context.json - JSON sidecar file
  2. audio.wav.txt - Simple vocabulary list (one term per line)
  3. .context/audio.json - Centralized context directory

3. Global Vocabulary File

For terms common across all files:

uv run python main.py --vocabulary-file ./technical_terms.txt

Format:

# Technical terms
neural_network
kubernetes
microservices

# Corrections  
kube -> Kubernetes
ml ops -> MLOps

# Acronyms
K8S:Kubernetes
API:Application Programming Interface

# Phrases
"continuous integration pipeline"
"infrastructure as code"

Benefits of Context

  • Improved Accuracy: Domain-specific terms are recognized correctly
  • Speaker Attribution: Replace generic IDs with actual names
  • Corrections: Fix systematic transcription errors
  • Acronym Expansion: Automatically expand technical acronyms

Using Raw Content Files

Extract context from meeting agendas, documentation, presentations, or any related text/HTML files:

Global Content Files

# Single content file
uv run python main.py --input-dir ./inputs --content-file meeting_agenda.html

# Multiple content files
uv run python main.py --input-dir ./inputs \
  --content-file agenda.html \
  --content-file technical_spec.md \
  --content-file presentation.txt

# Directory of content files
uv run python main.py --input-dir ./inputs --content-dir ./meeting_materials

Per-Audio Content Files

Place companion content files next to audio files:

  • meeting.wav β†’ meeting.wav.content.txt (or .html, .md)
  • presentation.wav β†’ presentation.wav.content.html

The pipeline automatically detects and uses these companion files.

What Gets Extracted

  • Technical Terms: CamelCase, snake_case, hyphenated terms
  • Acronyms: Automatically detected with expansions
  • Proper Names: People, products, companies
  • Key Phrases: Frequently mentioned multi-word terms
  • Identifiers: Version numbers, ticket IDs, codes

Example extracted context:

{
  "vocabulary": ["kubernetes", "langchain", "embeddings", "gpt-4-turbo"],
  "acronyms": {
    "LLM": "Large Language Model",
    "RAG": "Retrieval Augmented Generation",
    "ROI": "Return on Investment"
  },
  "phrases": ["vector database", "machine learning pipeline"],
  "topic": "AI Strategy Meeting - Q4 2024"
}

βš™οΈ Configuration Options

Audio Processing

  • sample_rate: Target sample rate (default: 16000 Hz)
  • batch_size: Number of files to process in parallel
  • max_duration: Maximum segment duration for processing

ASR Settings

  • model_name: NVIDIA NeMo model to use
    • nvidia/parakeet-tdt-0.6b-v2 (fastest, transducer-based model)
    • stt_en_conformer_ctc_small (fast, lower accuracy)
    • stt_en_conformer_ctc_medium (balanced)
    • stt_en_conformer_ctc_large (slow, higher accuracy)
  • device: Processing device (cuda or cpu)
  • enable_confidence_scores: Include confidence scores in output

Speaker Diarization

  • enable_diarization: Enable/disable speaker identification (default: true)
  • Command line: --enable-diarization / --disable-diarization
  • When enabled, segments are automatically labeled with speaker IDs (SPEAKER_00, SPEAKER_01, etc.)
  • Uses NVIDIA NeMo's clustering diarization for accurate speaker separation

Processing Options

  • enable_caching: Cache models and intermediate results
  • parallel_workers: Number of parallel processing workers
  • log_level: Logging verbosity (DEBUG, INFO, WARNING, ERROR)
  • --clear-cache: Clear cached transcription results before processing
    • Use alone to clear cache without processing: uv run python main.py --clear-cache
    • Use with input directory to clear cache and process: uv run python main.py --input-dir ./inputs --clear-cache

Output Options

  • formats: Output formats (["json", "txt", "attributed_txt"])
  • include_timestamps: Include timing information
  • include_confidence: Include confidence scores in output
  • pretty_json: Format JSON with indentation

πŸ§ͺ Testing

Run the comprehensive test suite:

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=term-missing

# Run specific test file
uv run pytest tests/test_models.py

# Run with verbose output
uv run pytest -v

πŸƒβ€β™‚οΈ Performance

GPU Performance (RTX Titan 24GB)

  • Small model: ~15x real-time speed
  • Medium model: ~8x real-time speed
  • Large model: ~4x real-time speed

CPU Performance (16-core)

  • Small model: ~1.5x real-time speed
  • Medium model: ~0.8x real-time speed
  • Large model: ~0.4x real-time speed

πŸ› οΈ Development

Project Structure

audio_aigented/
β”œβ”€β”€ src/audio_aigented/           # Main package
β”‚   β”œβ”€β”€ audio/                    # Audio loading and preprocessing
β”‚   β”œβ”€β”€ transcription/            # ASR processing with NeMo
β”‚   β”œβ”€β”€ formatting/               # Output formatting
β”‚   β”œβ”€β”€ output/                   # File writing
β”‚   β”œβ”€β”€ config/                   # Configuration management
β”‚   β”œβ”€β”€ models/                   # Pydantic data models
β”‚   └── pipeline.py               # Main orchestration
β”œβ”€β”€ tests/                        # Test suite
β”œβ”€β”€ config/                       # Configuration files
β”œβ”€β”€ examples/                     # Usage examples
└── main.py                       # CLI entry point

Code Style

  • Uses ruff for linting and formatting
  • Follows PEP8 with type hints
  • Google-style docstrings
  • Maximum 500 lines per file

Adding New Features

  1. Create feature branch
  2. Implement with comprehensive tests
  3. Update documentation
  4. Submit pull request

πŸ”§ Troubleshooting

Common Issues

CUDA Out of Memory

# Use smaller batch size
uv run python main.py --input-dir ./inputs --config config/cpu_config.yaml

# Or switch to CPU
uv run python main.py --input-dir ./inputs --device cpu

No Audio Files Found

  • Ensure .wav files are in the input directory
  • Check file permissions
  • Use --dry-run to see what files would be processed

Model Download Issues

  • NeMo models are downloaded automatically on first use
  • Ensure internet connection for initial model download
  • Models are cached in ~/.cache/torch/NeMo/

πŸ“„ License

[Your License Here]

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Update documentation
  5. Submit a pull request

πŸ“ž Support

  • Create an issue for bugs or feature requests
  • Check existing issues before creating new ones
  • Provide detailed information including error logs

Built with ❀️ using NVIDIA NeMo and modern Python practices

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published