Purpose of this project:

To provide a simple and easy-to-use solution for automatic speech transcription with speaker diarization using faster-whisper and pyannote.audio.
To provide a containerized solution for batch processing of audio files.
To provide a solution for transcribing audio files using OpenAI's Whisper model (with fallback to faster-whisper)
To provide a solution for identifying different speakers in the audio (diarization) using pyannote.audio
To provide a solution for outputting the results in various formats (VTT, SRT, or TXT)
Most importantly, to provide a solution testing the performance of different models and configurations for batch processing of audio files and reviewing the different trade-offs involved from Python and C++.

TL;DR:

In the cpp_version directory, I have implemented a C++ version with it's own README.md and a whisper_benchmark.csv file with the results of the transcription and diarization.

In this directory, I have implemented a Python version with this README.md.

Add mp3 files to the data directory and run the container.

# Build the python version
docker compose up --build
# Put your audio file in the data/ directory and run the container
docker compose run --rm whisper-diarize audio.mp3 --model tiny --num-speakers 2

# Or run the cpp version
cd cpp_version
bash ./examples/run_example.sh --file /data/audio.mp3 --model tiny

Whisper Diarization Docker

A containerized solution for automatic speech transcription with speaker diarization using faster-whisper and pyannote.audio.

This is an efficient method to apply Whisper on CPU-based systems, focusing on batch processing. It uses optimization techniques such as quantization and alternative frameworks for improved performance, combined with speaker diarization capabilities.

Overview

This Docker container provides an easy-to-use solution for:

Transcribing audio files using OpenAI's Whisper model (with fallback to faster-whisper)
Identifying different speakers in the audio (diarization) using pyannote.audio
Outputting the results in various formats (VTT, SRT, or TXT)

Prerequisites

Docker and Docker Compose installed on your system
A Hugging Face account and API token (for accessing the diarization models)
Audio files you want to transcribe (supported formats: mp3, wav, m4a, etc.)

Setup

Clone this repository:

git clone https://github.com/DonRichards/whisper_testing
cd whisper_testing

Get a Hugging Face API token:
- Create an account on Hugging Face
- Go to your profile settings and create an access token
- The token must have at least READ permission
Accept the model license agreements:
- Visit both:
  - https://huggingface.co/pyannote/speaker-diarization
  - https://huggingface.co/pyannote/segmentation
- You may need to first log in to your Hugging Face account
- Click the "Access repository" button or "Files and versions"
- Read and accept the terms of use for both models
Configure your environment:
- Copy the example .env file: cp .env.example .env
- Edit the .env file and add your Hugging Face token:
```
HF_TOKEN=your_token_here
```

Usage

Place your audio files in the data directory:

mkdir -p data
cp path/to/your/audio.mp3 data/

Run the transcription:

# Basic usage (output will be audio.vtt)
docker compose run --rm whisper-diarize /data/audio.mp3 --model tiny

# Specify number of speakers
docker compose run --rm whisper-diarize /data/audio.mp3 --model tiny --num-speakers 2

# Choose output format
docker compose run --rm whisper-diarize /data/audio.mp3 --model tiny --format txt

The output file will automatically use the same name as the input file but with the appropriate extension (e.g., audio.mp3 → audio.vtt)

Model Options

Available Whisper model sizes:

tiny (fastest, least accurate)
base
small
medium (default)
large
large-v2 (slowest, most accurate)

Pre-downloading Models

To avoid downloading models every time you run the container, you can pre-download them:

# Create models directory
mkdir -p models

# Download a specific model size
# Options: "tiny", "base", "small", "medium", "large", "large-v2"
python download_whisper_models.py --model tiny

Configuration Options

Command line arguments:

--model - Whisper model size (default: medium)
--output - Custom output file path (optional)
--format - Output format: vtt, srt, or txt (default: vtt)
--num-speakers - Number of speakers expected in the audio
--language - Language code for transcription
--task - Choose between "transcribe" or "translate" (to English)

Troubleshooting

Common Issues

File Not Found Error
- Make sure your audio file is in the ./data directory
- Use the correct path (/data/filename.mp3 inside the container)
Speaker Diarization Not Working
- Verify your Hugging Face token is correct
- Ensure you've accepted the terms for both required models:
  - pyannote/speaker-diarization
  - pyannote/segmentation
Model Download Issues
- Check your internet connection
- Verify the models directory has correct permissions
- Try pre-downloading the model using download_whisper_models.py

Cleaning Up

To avoid accumulating orphan containers:

# Use --rm flag when running
docker compose run --rm whisper-diarize ...

# Or clean up manually
docker compose down --remove-orphans
docker container prune

Output Formats

VTT (default)
- WebVTT format with speaker labels
- Compatible with most video players
TXT
- Simple text format
- Each line prefixed with speaker label
SRT
- SubRip format (currently outputs as VTT)
- Planned for future implementation

Benchmarking

The system automatically logs performance metrics to /data/whisper_benchmarks.csv, including:

Timestamp of the run
Audio filename and duration
Model used and number of speakers
Processing time and real-time factor
CPU and memory usage

Expected performance for a 5-minute audio file on a typical CPU:

tiny: ~1-2x real-time (5-10 minutes)
base: ~2-3x real-time (10-15 minutes)
small: ~3-4x real-time (15-20 minutes)
medium: ~4-6x real-time (20-30 minutes)
large: ~6-8x real-time (30-40 minutes)

Note: Performance can vary significantly based on:

CPU speed and number of cores
Audio quality and complexity
Number of speakers
Whether diarization is enabled

Processing Time Estimates

The system will provide an estimate of processing time based on:

Audio file duration
Selected model size
Whether speaker diarization is enabled

Approximate processing speeds (on a typical CPU):

Transcription Only:
- tiny: ~1.2x real-time
- base: ~2.0x real-time
- small: ~3.0x real-time
- medium: ~4.5x real-time
- large: ~6.0x real-time
- large-v2: ~7.0x real-time
With Speaker Diarization:
- Add approximately 1.5x real-time

Example:

10-minute audio file
Using 'medium' model
With speaker diarization
Expected time: (10 min × 4.5) + (10 min × 1.5) = 60 minutes

Note: Actual processing times may vary based on:

CPU speed and number of cores
Available memory
Audio quality and complexity
Number of speakers
Background noise levels

Whisper Testing Project

This repository contains testing and benchmarking tools for the Whisper speech recognition model, using the C++ implementation from whisper.cpp.

Project Structure

TO DO

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
cpp_version		cpp_version
data		data
models/whisper		models/whisper
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
count_speakers_in_audio.py		count_speakers_in_audio.py
docker-compose.yml		docker-compose.yml
download_models.py		download_models.py
download_whisper_models.py		download_whisper_models.py
entrypoint.sh		entrypoint.sh
gitignore		gitignore
google_colab_whisper_testing.ipynb		google_colab_whisper_testing.ipynb
preprocess_audio.sh		preprocess_audio.sh
setup.sh		setup.sh
transcribe_diarize.py		transcribe_diarize.py
transcribe_diarize_sciserver.ipynb		transcribe_diarize_sciserver.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Purpose of this project:

TL;DR:

Whisper Diarization Docker

Overview

Prerequisites

Setup

Usage

Model Options

Pre-downloading Models

Configuration Options

Troubleshooting

Common Issues

Cleaning Up

Output Formats

Benchmarking

Processing Time Estimates

Whisper Testing Project

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

jhu-sheridan-libraries/whisper_testing

Folders and files

Latest commit

History

Repository files navigation

Purpose of this project:

TL;DR:

Whisper Diarization Docker

Overview

Prerequisites

Setup

Usage

Model Options

Pre-downloading Models

Configuration Options

Troubleshooting

Common Issues

Cleaning Up

Output Formats

Benchmarking

Processing Time Estimates

Whisper Testing Project

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages