Skip to content

Dropped sentences in ASR transcript w/ Parakeet v3/v2 #128

@hamzaq2000

Description

@hamzaq2000

Transcribed this video using Parakeet v2 and v3.

# Clone repo
git clone https://github.com/FluidInference/FluidAudio.git
cd FluidAudio

# Download and convert to 16 KHz 16-bit wav
uvx "yt-dlp[default]" -f bestaudio GT_sXIUJPUo -o "temp_GT_sXIUJPUo.%(ext)s"
ffmpeg -i "temp_GT_sXIUJPUo."* -acodec pcm_s16le -ac 1 -ar 16000 cowen.wav
rm "temp_GT_sXIUJPUo."*

# Build
swift build -c release

# Transcribe
swift run fluidaudio transcribe cowen.wav --model-version v2
swift run fluidaudio transcribe cowen.wav --model-version v3

I also transcribed the same file using parakeet-mlx to compare.

All results can be found here: cowen_results.zip

Inspected the first ~1.5 pages of all transcripts and noticed:

  • Fluid Audio parakeet v3 drops a lot of sentences. See fluid_audio_parakeet_v3/cowen_fluidaudio_parakeet_v3_errors.pdf in the zip file.
  • Fluid Audio parakeet v2 still drops sentences but much fewer (only 1 vs 8 with v3). See fluid_audio_parakeet_v2/cowen_fluidaudio_parakeet_v2_errors.pdf in the zip file.
  • parakeet-mlx drops no sentences at all. See parakeet_mlx_parakeet_v3/cowen_parakeet_mlx_parakeet_v3 in the zip file.

Wondering if this is a known issue? Or is it just this one audio file causing a malfunction somehow?

Unfortunate, because the FluidAudio parakeet-tdt implementations are otherwise the fastest on Apple Silicon, faster than parakeet-mlx. Hopefully fixable; I will try to understand the model architecture and code and see if I'm able to find the cause as well.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingspeech-to-textissues related to transcription/asr

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions