Skip to content

werent4/wav2vec-emotions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Emotion Recognition with Wav2Vec2

A cross-platform solution for speech emotion recognition using Wav2Vec2 model with Python training pipeline and C++ inference.

Overview

This project provides a complete pipeline for speech emotion recognition:

  1. Python components for training and fine-tuning Wav2Vec2 models on emotion recognition tasks
  2. ONNX export for cross-platform model compatibility
  3. C++ inference engine for real-time emotion recognition applications

The system classifies spoken emotions into categories including: angry, sad, neutral, happy, excited, frustrated, fear, surprise, and disgust.

Features

  • Fine-tune Facebook's Wav2Vec2 models for emotion recognition
  • Data augmentation techniques for improved generalization
  • Focal Loss implementation for handling class imbalance
  • Export models to ONNX format for deployment
  • Fast C++ inference using ONNX Runtime
  • Audio preprocessing and feature extraction
  • Support for both CPU and GPU acceleration

Project Structure

├── python/                 # Python training and export components
│   ├── configs/            # Configuration files
│   ├── src/                # Source code
│       ├── train.py        # Training script
│       ├── inference.py    # Python inference
│       ├── export.py       # Model export to ONNX
│       ├── dataset.py      # Dataset handling
│       ├── augments.py     # Audio augmentation
│       ├── loss_function.py # Focal loss implementation
│
├── cpp/                    # C++ inference engine
│   ├── include/            # Header files
│   ├── src/                # C++ source files
│   ├── CMakeLists.txt      # Build configuration

Requirements

Python

  • Python 3.8+
  • PyTorch
  • Transformers
  • Librosa
  • NumPy
  • scikit-learn
  • datasets
  • torchaudio

C++

  • CMake 3.10+
  • ONNX Runtime
  • libsndfile
  • C++17 compatible compiler
  • jsoncpp

Installation

Python Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install torch torchaudio transformers datasets librosa scikit-learn numpy

C++ Build

cd cpp
mkdir -p build && cd build
cmake ..
make -j$(nproc)

NOTE: Windows builds are note supported yet!

Usage

Training a Model

  1. Prepare your configuration file (see python/configs/train.json for an example)
  2. Run the training script:
python python/src/train.py --config python/configs/train.json

Optional parameters:

  • --model_name: Specify a different base model (overrides config)
  • --output_dir: Change the output directory (overrides config)

Exporting to ONNX

After training, export your model to ONNX format:

python python/src/export.py --model_path output_model/best_f1 --output_path onnx_model

Parameters:

  • --model_path: Path to the trained PyTorch model
  • --output_path: Output directory for ONNX model
  • --sample_rate: Audio sample rate (default: 16000)
  • --duration: Maximum audio duration in seconds (default: 5)

Inference with Python

python python/src/inference.py --model_name output_model/best_f1 --audio_file_path path/to/your/audio.wav

Inference with C++

./build/wav2vec-emotions path/to/your/audio.wav onnx_model/metadata.json

Training Configuration

The training configuration file (train.json) contains the following parameters:

{
    "model_name": "facebook/wav2vec2-base",
    "seed": 42,
    "val_size": 0.15,
    "max_lenght_seconds": 5,
    "is_simplified": false,
    "gamma": 2.0,
    "alpha": 0.1, 
    "training_args": {
        "output_dir": "output_model",
        "save_strategy": "epoch",
        "learning_rate": 3e-5,
        "per_device_train_batch_size": 16,
        "per_device_eval_batch_size": 16,
        "gradient_accumulation_steps": 4,
        "num_train_epochs": 10,
        "lr_scheduler_type": "cosine", 
        "weight_decay": 0.04, 
        "warmup_ratio": 0.1,
        "logging_steps": 50,
        "eval_strategy": "epoch",
        "save_total_limit": 2,
        "fp16": true,
        "dataloader_num_workers": 8,
        "report_to": "none",
        "load_best_model_at_end": true,
        "metric_for_best_model": "f1"
    }
}

Emotion Classes

The system classifies emotions into the following categories:

  • angry (0)
  • sad (1)
  • neutral (2)
  • happy (3)
  • excited (4)
  • frustrated (5)
  • fear (6)
  • surprise (7)
  • disgust (8)
  • unknown (9)

Dataset

The system is designed to work with the IEMOCAP dataset, which can be loaded using the Hugging Face datasets library:

dataset = load_dataset("AbstractTTS/IEMOCAP", trust_remote_code=True)

Audio Augmentation

The system implements several audio augmentation techniques:

  • Volume change
  • Noise addition
  • Tempo modification
  • Pitch shifting

Acknowledgements

TODO

  • Add LibTorch backend support
  • Add TensorRT backend support
  • Add CUDA optimizations for preprocessing
  • Add CUDA optimizations for postprocessing
  • Add bacth support

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published