-
-
Notifications
You must be signed in to change notification settings - Fork 742
Open
Description
Bug
When using response_format: "opus" on the non-streaming /v1/audio/speech endpoint, the output audio is consistently truncated by ~1-2 seconds compared to the same text rendered as MP3. The last word(s) get cut off.
Reproduction
Same text, same voice, same speed — only response_format differs:
# MP3 — full audio
curl -s http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Test. The last word in this sentence should be clearly audible, and that word is Constantinople.","voice":"am_puck","model":"kokoro","response_format":"mp3","speed":1.2}' \
--output test_mp3.mp3
# Opus — truncated
curl -s http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Test. The last word in this sentence should be clearly audible, and that word is Constantinople.","voice":"am_puck","model":"kokoro","response_format":"opus","speed":1.2}' \
--output test_opus.ogg
ffprobe -v error -show_entries format=duration -of csv=p=0 test_mp3.mp3 # → 5.016s
ffprobe -v error -show_entries format=duration -of csv=p=0 test_opus.ogg # → 3.000sObservations
- Opus duration is always a round integer (2.000, 3.000, 5.000, 9.000, 15.000 seconds) — suggests a page/frame boundary issue where the final OGG page isn't being flushed
- The gap scales with text length but is consistently ~1-2 seconds
- MP3, WAV, FLAC, PCM all produce full-length audio — only opus is affected
- Voice blending doesn't change the behavior — tested with single voices and blends
- MP3 → ffmpeg → OGG Opus conversion preserves the full duration as a workaround
Test matrix
| Text length | MP3 duration | Opus duration | Lost |
|---|---|---|---|
| Short (1 sentence) | 3.4s | 2.0s | ~1.4s |
| Medium (2 sentences) | 5.0s | 3.0s | ~2.0s |
| Long (4 sentences) | 10.2s | 9.0s | ~1.2s |
| Longer (6 sentences) | 16.4s | 15.0s | ~1.4s |
Environment
- Kokoro-FastAPI GPU Docker (NVIDIA, PyTorch)
- Linux host
- Tested via direct curl (not streaming)
Likely cause
The OGG Opus encoder isn't flushing the final page. The round-number durations strongly suggest the output is being truncated to a page boundary rather than including a short final page with the remaining audio.
Workaround
Generate as MP3, convert to OGG Opus with ffmpeg:
ffmpeg -i output.mp3 -c:a libopus -b:a 64k output.oggThis preserves the full duration.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels