Enkidu

Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes

Enkidu is an open-source implementation of the framework described in
"Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes" (ACM MM 2025).

The framework generates lightweight and imperceptible Universal Frequential Perturbations (UFPs) in the frequency domain, which can be attached to user audio to protect against voice cloning and deepfake attacks in real time.

✨ Features

Universal Perturbation (UFP): One-time optimization, reusable across arbitrary user audio.
Real-Time Protection: Efficient frequency-domain perturbation, deployable on CPU/GPU in real time.
Flexible Training: Combines embedding separation loss and perceptual loss.
API + CLI: Use as a Python module or as a command-line tool.

⚙️ Installation

Dependencies:

Python >= 3.9 (3.10 recommended)
torch == 2.4.0
torchaudio == 2.4.0
speechbrain == 1.0.0

📂 Project Structure

Enkidu/                   
├── core/                         # Core implementation    
│   ├── __init__.py    
│   └── enkidu.py                 # Main Enkidu model     
│       
├── wavdataset/                   # Dataset utilities      
│   ├── __init__.py              
│   └── waveform_privacy_dataset.py                
│                    
├── test_enkidu.py                # Example usage (Python API)                
├── cli_enkidu.py                 # Command-line interface                     
└── README.md

🚀 Usage

Python API:

You can call Enkidu directly inside Python scripts:

import torch
import torchaudio
from speechbrain.inference import SpeakerRecognition
from wavdataset import WaveformPrivacyDataset
from core import Enkidu

device = 'cuda:0'

# Load dataset (LibriSpeech example)
wave_dataset = WaveformPrivacyDataset(
    dataset_dir='/path/to/LibriSpeech/test-clean',
    sample_rate=16000,
    mono=True,
    wav_format='flac',
)

# Get 40 samples from a single speaker
sample_list = wave_dataset.get_speaker_samples(0)[:40]

# Initialize Enkidu model
enkidu = Enkidu(
    model=SpeakerRecognition.from_hparams('speechbrain/spkrec-ecapa-voxceleb', run_opts={"device": device}),
    steps=10,
    alpha=0.1,
    mask_ratio=0.3,
    frame_length=30,
    noise_level=0.4,
    device=device,
)

# Optimize universal noise
noise_real, noise_imag = enkidu(sample_list)

# Protect a new 10-second voice sample
benign_voice = torch.randn(1, 160000)  # Simulated 10s waveform
torchaudio.save('benign_voice.wav', benign_voice, 16000)

encrypted_voice = enkidu.add_noise(
    benign_voice,
    noise_real,
    noise_imag,
    mask_ratio=0.3,
    random_offset=False,
    noise_smooth=True,
)

torchaudio.save('encrypted_voice.wav', encrypted_voice, 16000)
print("Encrypted audio saved to encrypted_voice.wav")

Command-line Interface (CLI):

You can also run Enkidu directly from the command line:

python cli_enkidu.py \
    --audios_dir /path/to/train_audios \
    --wav_format flac \
    --steps 100 \
    --alpha 0.1 \
    --mask_ratio 0.3 \
    --frame_length 30 \
    --noise_level 0.4 \
    --device cuda:0 \
    --input_waveform benign_voice.wav \
    --output_waveform encrypted_voice.wav

The CLI takes the following arguments:

Enkidu options (training UFP noise):

--audios_dir (str, required): Directory of training audios for optimizing universal noise.
--sample_rate (int, optional): Target sample rate (resampling will be applied if different).
--mono (bool, optional): Convert input to mono if set.
--wav_format (str, required): Audio format to search under audios_dir (wav, flac, m4a, ogg, mp3).
--steps (int, default=40): Number of optimization steps for noise learning.
--alpha (float, default=0.1): Learning rate for optimizer.
--mask_ratio (float, default=0.3): Proportion of frames masked during training (for augmentation).
--frame_length (int, default=30): Frame length in STFT domain for tiling perturbations.
--noise_level (float, default=0.1): Amplitude scaling factor of noise.
--noise_smooth (bool, default=False): Apply Wiener filtering to smooth noise in frequency domain.
--device (str, default="cuda:0"): Device to run the model (cpu, cuda:0, etc.).

Encryption options (applying learned noise):

--input_waveform (str, required): Path to the input (benign) audio file.
--output_waveform (str, required): Path to save the encrypted (protected) audio.

📖 Citation

If you find this work useful, please cite our paper:

@inproceedings{feng2025enkidu,
author = {Feng, Zhou and Chen, Jiahao and Zhou, Chunyi and Pu, Yuwen and Li, Qingming and Du, Tianyu and Ji, Shouling},
title = {Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes},
year = {2025},
isbn = {9798400720352},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746027.3755629},
doi = {10.1145/3746027.3755629},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
pages = {11638–11647},
numpages = {10},
location = {Dublin, Ireland},
series = {MM '25}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
core		core
wavdataset		wavdataset
.gitignore		.gitignore
README.md		README.md
enkidu_cli.py		enkidu_cli.py
test_enkidu.py		test_enkidu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enkidu

✨ Features

⚙️ Installation

📂 Project Structure

🚀 Usage

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Enkidu

✨ Features

⚙️ Installation

📂 Project Structure

🚀 Usage

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages