[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex by jetjodh · Pull Request #2234 · huggingface/huggingface.js

jetjodh · 2026-06-15T19:05:42Z

Summary

Adds and fixes audio model support for the fal-ai inference provider.

Fix audio-to-audio for fal-ai audioToAudio() now resolves url/headers/signal via makeRequestOptions and forwards them to getResponse (mirroring imageSegmentation), so the fal queue task can poll for results instead of throwing "URL and headers are required for audio-to-audio task". Input data-URL MIME handling reuses FAL_AI_AUDIO_MIME_MAP (consistent with the ASR fix), so audio/wav/audio/webm/etc. aren't rejected by fal's data-URL decoder.
Add text-to-audio support. New queue-based FalAITextToAudioTask (handles both audio_file and audio result shapes), a textToAudio() task function (auto-exposed on InferenceClient via the tasks barrel), a widened TextToAudioTaskHelper.getResponse signature, and export of the (previously defined-but-unexported) text-to-audio inference types from @huggingface/tasks.
ASR: handle NeMo/nemotron output shape + timestamps. fal's nemotron ASR endpoint returns the transcript under output (with a partial flag), not text like whisper — so the existing helper would reject it. getResponse now parses both text and output, and normalizes timestamps from chunks (whisper) or segments to HF's AutomaticSpeechRecognitionOutput.chunks.

Testing

New unit tests: fal-ai audio-to-audio, text-to-audio, and automatic-speech-recognition (text/output/chunks/segments parsing).
Verified end-to-end against the live fal API: nemotron ASR (correct transcript) and PersonaPlex batch audio-to-audio (audio returned).
tsc --noEmit, eslint, and the new vitest specs all pass.

Note

Medium Risk
Changes the public audioToAudio() request/response path and fal ASR parsing; mistakes could break fal queue audio flows or reject valid ASR payloads, though behavior is covered by new unit tests.

Overview
fal-ai audio-to-audio is wired end-to-end: the task is registered, a queue-based FalAIAudioToAudioTask sends remapped audio_url payloads and polls until it can download returned audio as base64 AudioToAudioOutput[]. audioToAudio() now uses per-provider preparePayloadAsync and passes url/headers/signal into getResponse (same pattern as image segmentation) so queue tasks no longer fail with missing URL/headers.

ASR on fal-ai accepts transcripts in text (Whisper) or output (NeMo/nemotron) and optionally maps chunks or segments into HF AutomaticSpeechRecognitionOutput.chunks. Input encoding for ASR and audio-to-audio shares buildFalAiAudioDataUrl so browser MIME types (e.g. audio/webm) map to labels fal’s data-URL decoder accepts.

AudioToAudioTaskHelper and hf-inference gain preparePayloadAsync; image-to-image fal payload assembly is tightened (parameters + image_url / image_urls without duplicate spreads). New vitest coverage for fal audio-to-audio queue flow and ASR response shapes.

^{Reviewed by Cursor Bugbot for commit 02cd2e2. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds an audio-to-audio task handler for the fal-ai provider so that HF partner mappings with that task (e.g. nvidia/personaplex-7b-v1 -> fal-ai/personaplex) can be promoted to live via the partner API. - AudioToAudioTaskHelper now requires preparePayloadAsync (mirrors the ASR helper) so providers that need to async-encode the input blob can hook in there. - audioToAudio.ts now calls providerHelper.preparePayloadAsync(args) instead of the sync preparePayload util from audio/utils.ts. - HFInferenceAudioToAudioTask gets a passthrough preparePayloadAsync that returns { data: Blob, ... }, preserving the existing raw binary body behavior for hf-inference. - New FalAIAudioToAudioTask extends FalAiQueueTask: preparePayloadAsync validates blob type against FAL_AI_SUPPORTED_BLOB_TYPES and base64-encodes the audio into audio_url: data:audio/...;base64,... for the fal queue payload. getResponse polls the queue, fetches the result audio URL, and returns [{ blob, content-type, label }] where label is the generated transcript when the fal app returns one, else "speech". - Wires audio-to-audio into the fal-ai entry of PROVIDERS in getProviderHelper.ts. Made-with: Cursor

Ensure audio-to-audio resolves providers with the same model and endpoint context as other binary audio tasks, and allow common audio MIME types that fal endpoints can normalize. Made-with: Cursor

…nse + tests The fal-ai audio-to-audio task is a queue task, so getResponse needs url and headers to poll the status/result endpoints. Wire them through audioToAudio() via makeRequestOptions (mirroring imageSegmentation), widen the AudioToAudioTaskHelper.getResponse signature accordingly, and add unit tests covering the queue happy-path, malformed response, and MIME remap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add FalAITextToAudioTask (queue-based, handles `audio_file`/`audio` result shapes) and register fal-ai for the text-to-audio task. Add a textToAudio() task function — which is auto-exposed on InferenceClient via the tasks barrel — widen TextToAudioTaskHelper.getResponse to forward outputType/signal, and export the text-to-audio inference types from @huggingface/tasks (previously defined but not re-exported). Includes unit tests for the queue happy-path, the `audio` fallback, malformed responses, and prompt/parameter payload mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…amps fal's nemotron ASR endpoint (nvidia/nemotron-asr-multilingual/asr) returns the transcript under `output` (with a `partial` flag), not `text` like fal whisper — so the existing helper would reject it. Parse both `text` and `output`, and normalize timestamps from `chunks` (whisper) or `segments` to HF's AutomaticSpeechRecognitionOutput.chunks. Add a dev hardcoded mapping for nvidia/nemotron-3.5-asr-streaming-0.6b -> the fal slug until it's registered for fal-ai on huggingface.co. Verified end-to-end against the live fal API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nvidia/personaplex-7b-v1 (pipeline_tag: audio-to-audio) is served by fal's batch endpoint fal-ai/personaplex, which returns { audio: { url }, text } — already handled by FalAIAudioToAudioTask. Add a dev hardcoded mapping until it's registered for fal-ai on huggingface.co. This covers the one-shot speech-to-speech turn; the real-time full-duplex mode is WebSocket-only and out of scope for this HTTP/queue client. Verified end-to-end against the live fal API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop the nemotron / personaplex entries from HARDCODED_MODEL_INFERENCE_MAPPING; these models should be wired up via the HF partner mapping for fal-ai instead of hardcoded dev stopgaps. The provider helpers already handle their request/response shapes, so no code change is needed once the models are registered on huggingface.co. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… blob into the payload `FalAIImageToImageTask.preparePayloadAsync` built a clean payload via `omit(args, ["inputs", "parameters"])`, then spread `...args` again — which re-injected the raw `inputs` Blob and the nested `parameters` object back into the JSON body sent to fal, undoing the omit. Drop the redundant `...args` spread so only the base64 `image_url`/`image_urls` data-URLs and the flattened parameters are sent, matching the sibling image-text-to-image / text-to-video / audio tasks. The return is cast to `RequestArgs` like those siblings, since the payload is keyed by `image_url` rather than a `RequestArgs` union discriminant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

hanouticelina

Thanks @jetjodh! did you test text-to-audio via live testing? Also would it be possible to split the audio-to-audio and text-to-audio support addition into two separate PRs? 🙏

hanouticelina · 2026-06-22T15:11:21Z

+			throw new InferenceClientProviderOutputError(
+				`Received malformed response from Fal.ai text-to-audio API: expected { audio_file: { url: string } } result format, got instead: ${JSON.stringify(
+					result,


Suggested change

throw new InferenceClientProviderOutputError(

`Received malformed response from Fal.ai text-to-audio API: expected { audio_file: { url: string } } result format, got instead: ${JSON.stringify(

result,

throw new InferenceClientProviderOutputError(

`Received malformed response from Fal.ai text-to-audio API: expected { audio_file: { url: string } } or { audio: { url: string } } result format, got instead: ${JSON.stringify(

result,

jetjodh · 2026-06-22T16:24:10Z

Thanks @jetjodh! did you test text-to-audio via live testing? Also would it be possible to split the audio-to-audio and text-to-audio support addition into two separate PRs? 🙏

can do that

Remove the text-to-audio additions from this PR; they now live in a dedicated PR (huggingface#2249). This PR keeps the audio-to-audio fix and the nemotron ASR `output`/timestamps handling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jetjodh · 2026-06-22T16:31:36Z

@hanouticelina Done 🙏

Split: moved text-to-audio to [Inference] Add fal-ai text-to-audio support #2249. This PR now covers audio-to-audio
Live testing: yes — text-to-audio and ASR were also live-tested.

jetjodh and others added 6 commits June 15, 2026 10:37

[Inference] Fix fal-ai audio-to-audio request handling

d9169d5

Ensure audio-to-audio resolves providers with the same model and endpoint context as other binary audio tasks, and allow common audio MIME types that fal endpoints can normalize. Made-with: Cursor

jetjodh requested review from SBrandeis, Wauplin, gary149, hanouticelina, julien-c, ngxson and pcuenca as code owners June 15, 2026 19:05

jetjodh and others added 5 commits June 15, 2026 12:11

[Inference] Trim redundant comments in fal-ai audio helpers

6ad8df7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge branch 'main' into jetjodh/review-audio-model-support

fa11482

Merge branch 'main' into jetjodh/review-audio-model-support

2bb5e29

jetjodh force-pushed the jetjodh/review-audio-model-support branch from 36d9840 to a75706c Compare June 18, 2026 00:27

hanouticelina reviewed Jun 22, 2026

View reviewed changes

jetjodh mentioned this pull request Jun 22, 2026

[Inference] Add fal-ai text-to-audio support #2249

Open

Merge branch 'main' into jetjodh/review-audio-model-support

02cd2e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex#2234

[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex#2234
jetjodh wants to merge 13 commits into
huggingface:mainfrom
jetjodh:jetjodh/review-audio-model-support

jetjodh commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

hanouticelina left a comment

Uh oh!

hanouticelina Jun 22, 2026

Uh oh!

jetjodh commented Jun 22, 2026

Uh oh!

jetjodh commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jetjodh commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

hanouticelina left a comment

Choose a reason for hiding this comment

Uh oh!

hanouticelina Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

jetjodh commented Jun 22, 2026

Uh oh!

jetjodh commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jetjodh commented Jun 15, 2026 •

edited by cursor Bot

Loading

jetjodh commented Jun 22, 2026 •

edited

Loading