Skip to content

Opus response_format truncates audio — final 1-2 seconds lost (OGG page flush issue?) #447

@will-assistant

Description

@will-assistant

Bug

When using response_format: "opus" on the non-streaming /v1/audio/speech endpoint, the output audio is consistently truncated by ~1-2 seconds compared to the same text rendered as MP3. The last word(s) get cut off.

Reproduction

Same text, same voice, same speed — only response_format differs:

# MP3 — full audio
curl -s http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Test. The last word in this sentence should be clearly audible, and that word is Constantinople.","voice":"am_puck","model":"kokoro","response_format":"mp3","speed":1.2}' \
  --output test_mp3.mp3

# Opus — truncated
curl -s http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Test. The last word in this sentence should be clearly audible, and that word is Constantinople.","voice":"am_puck","model":"kokoro","response_format":"opus","speed":1.2}' \
  --output test_opus.ogg

ffprobe -v error -show_entries format=duration -of csv=p=0 test_mp3.mp3   # → 5.016s
ffprobe -v error -show_entries format=duration -of csv=p=0 test_opus.ogg  # → 3.000s

Observations

  • Opus duration is always a round integer (2.000, 3.000, 5.000, 9.000, 15.000 seconds) — suggests a page/frame boundary issue where the final OGG page isn't being flushed
  • The gap scales with text length but is consistently ~1-2 seconds
  • MP3, WAV, FLAC, PCM all produce full-length audio — only opus is affected
  • Voice blending doesn't change the behavior — tested with single voices and blends
  • MP3 → ffmpeg → OGG Opus conversion preserves the full duration as a workaround

Test matrix

Text length MP3 duration Opus duration Lost
Short (1 sentence) 3.4s 2.0s ~1.4s
Medium (2 sentences) 5.0s 3.0s ~2.0s
Long (4 sentences) 10.2s 9.0s ~1.2s
Longer (6 sentences) 16.4s 15.0s ~1.4s

Environment

  • Kokoro-FastAPI GPU Docker (NVIDIA, PyTorch)
  • Linux host
  • Tested via direct curl (not streaming)

Likely cause

The OGG Opus encoder isn't flushing the final page. The round-number durations strongly suggest the output is being truncated to a page boundary rather than including a short final page with the remaining audio.

Workaround

Generate as MP3, convert to OGG Opus with ffmpeg:

ffmpeg -i output.mp3 -c:a libopus -b:a 64k output.ogg

This preserves the full duration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions