Skip to content

[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex#2234

Open
jetjodh wants to merge 13 commits into
huggingface:mainfrom
jetjodh:jetjodh/review-audio-model-support
Open

[Inference] fal-ai: audio-to-audio fix + text-to-audio + NeMo ASR & PersonaPlex#2234
jetjodh wants to merge 13 commits into
huggingface:mainfrom
jetjodh:jetjodh/review-audio-model-support

Conversation

@jetjodh

@jetjodh jetjodh commented Jun 15, 2026

Copy link
Copy Markdown

Summary

Adds and fixes audio model support for the fal-ai inference provider.

  1. Fix audio-to-audio for fal-ai audioToAudio() now resolves url/headers/signal via makeRequestOptions and forwards them to getResponse (mirroring imageSegmentation), so the fal queue task can poll for results instead of throwing "URL and headers are required for audio-to-audio task". Input data-URL MIME handling reuses FAL_AI_AUDIO_MIME_MAP (consistent with the ASR fix), so audio/wav/audio/webm/etc. aren't rejected by fal's data-URL decoder.

  2. Add text-to-audio support. New queue-based FalAITextToAudioTask (handles both audio_file and audio result shapes), a textToAudio() task function (auto-exposed on InferenceClient via the tasks barrel), a widened TextToAudioTaskHelper.getResponse signature, and export of the (previously defined-but-unexported) text-to-audio inference types from @huggingface/tasks.

  3. ASR: handle NeMo/nemotron output shape + timestamps. fal's nemotron ASR endpoint returns the transcript under output (with a partial flag), not text like whisper — so the existing helper would reject it. getResponse now parses both text and output, and normalizes timestamps from chunks (whisper) or segments to HF's AutomaticSpeechRecognitionOutput.chunks.

Testing

  • New unit tests: fal-ai audio-to-audio, text-to-audio, and automatic-speech-recognition (text/output/chunks/segments parsing).
  • Verified end-to-end against the live fal API: nemotron ASR (correct transcript) and PersonaPlex batch audio-to-audio (audio returned).
  • tsc --noEmit, eslint, and the new vitest specs all pass.

Note

Medium Risk
Changes the public audioToAudio() request/response path and fal ASR parsing; mistakes could break fal queue audio flows or reject valid ASR payloads, though behavior is covered by new unit tests.

Overview
fal-ai audio-to-audio is wired end-to-end: the task is registered, a queue-based FalAIAudioToAudioTask sends remapped audio_url payloads and polls until it can download returned audio as base64 AudioToAudioOutput[]. audioToAudio() now uses per-provider preparePayloadAsync and passes url/headers/signal into getResponse (same pattern as image segmentation) so queue tasks no longer fail with missing URL/headers.

ASR on fal-ai accepts transcripts in text (Whisper) or output (NeMo/nemotron) and optionally maps chunks or segments into HF AutomaticSpeechRecognitionOutput.chunks. Input encoding for ASR and audio-to-audio shares buildFalAiAudioDataUrl so browser MIME types (e.g. audio/webm) map to labels fal’s data-URL decoder accepts.

AudioToAudioTaskHelper and hf-inference gain preparePayloadAsync; image-to-image fal payload assembly is tightened (parameters + image_url / image_urls without duplicate spreads). New vitest coverage for fal audio-to-audio queue flow and ASR response shapes.

Reviewed by Cursor Bugbot for commit 02cd2e2. Bugbot is set up for automated code reviews on this repo. Configure here.

jetjodh and others added 6 commits June 15, 2026 10:37
Adds an audio-to-audio task handler for the fal-ai provider so that HF partner mappings with that task (e.g. nvidia/personaplex-7b-v1 -> fal-ai/personaplex) can be promoted to live via the partner API.

- AudioToAudioTaskHelper now requires preparePayloadAsync (mirrors the ASR helper) so providers that need to async-encode the input blob can hook in there.
- audioToAudio.ts now calls providerHelper.preparePayloadAsync(args) instead of the sync preparePayload util from audio/utils.ts.
- HFInferenceAudioToAudioTask gets a passthrough preparePayloadAsync that returns { data: Blob, ... }, preserving the existing raw binary body behavior for hf-inference.
- New FalAIAudioToAudioTask extends FalAiQueueTask: preparePayloadAsync validates blob type against FAL_AI_SUPPORTED_BLOB_TYPES and base64-encodes the audio into audio_url: data:audio/...;base64,... for the fal queue payload. getResponse polls the queue, fetches the result audio URL, and returns [{ blob, content-type, label }] where label is the generated transcript when the fal app returns one, else "speech".
- Wires audio-to-audio into the fal-ai entry of PROVIDERS in getProviderHelper.ts.

Made-with: Cursor
Ensure audio-to-audio resolves providers with the same model and endpoint context as other binary audio tasks, and allow common audio MIME types that fal endpoints can normalize.

Made-with: Cursor
…nse + tests

The fal-ai audio-to-audio task is a queue task, so getResponse needs url and
headers to poll the status/result endpoints. Wire them through audioToAudio()
via makeRequestOptions (mirroring imageSegmentation), widen the
AudioToAudioTaskHelper.getResponse signature accordingly, and add unit tests
covering the queue happy-path, malformed response, and MIME remap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add FalAITextToAudioTask (queue-based, handles `audio_file`/`audio` result
shapes) and register fal-ai for the text-to-audio task. Add a textToAudio()
task function — which is auto-exposed on InferenceClient via the tasks barrel —
widen TextToAudioTaskHelper.getResponse to forward outputType/signal, and export
the text-to-audio inference types from @huggingface/tasks (previously defined
but not re-exported). Includes unit tests for the queue happy-path, the `audio`
fallback, malformed responses, and prompt/parameter payload mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…amps

fal's nemotron ASR endpoint (nvidia/nemotron-asr-multilingual/asr) returns the
transcript under `output` (with a `partial` flag), not `text` like fal whisper —
so the existing helper would reject it. Parse both `text` and `output`, and
normalize timestamps from `chunks` (whisper) or `segments` to HF's
AutomaticSpeechRecognitionOutput.chunks. Add a dev hardcoded mapping for
nvidia/nemotron-3.5-asr-streaming-0.6b -> the fal slug until it's registered for
fal-ai on huggingface.co. Verified end-to-end against the live fal API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nvidia/personaplex-7b-v1 (pipeline_tag: audio-to-audio) is served by fal's batch
endpoint fal-ai/personaplex, which returns { audio: { url }, text } — already
handled by FalAIAudioToAudioTask. Add a dev hardcoded mapping until it's
registered for fal-ai on huggingface.co. This covers the one-shot speech-to-speech
turn; the real-time full-duplex mode is WebSocket-only and out of scope for this
HTTP/queue client. Verified end-to-end against the live fal API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jetjodh and others added 5 commits June 15, 2026 12:11
Drop the nemotron / personaplex entries from HARDCODED_MODEL_INFERENCE_MAPPING;
these models should be wired up via the HF partner mapping for fal-ai instead of
hardcoded dev stopgaps. The provider helpers already handle their request/response
shapes, so no code change is needed once the models are registered on huggingface.co.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… blob into the payload

`FalAIImageToImageTask.preparePayloadAsync` built a clean payload via
`omit(args, ["inputs", "parameters"])`, then spread `...args` again — which
re-injected the raw `inputs` Blob and the nested `parameters` object back into
the JSON body sent to fal, undoing the omit. Drop the redundant `...args` spread
so only the base64 `image_url`/`image_urls` data-URLs and the flattened
parameters are sent, matching the sibling image-text-to-image / text-to-video /
audio tasks. The return is cast to `RequestArgs` like those siblings, since the
payload is keyed by `image_url` rather than a `RequestArgs` union discriminant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jetjodh jetjodh force-pushed the jetjodh/review-audio-model-support branch from 36d9840 to a75706c Compare June 18, 2026 00:27

@hanouticelina hanouticelina left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jetjodh! did you test text-to-audio via live testing? Also would it be possible to split the audio-to-audio and text-to-audio support addition into two separate PRs? 🙏

Comment on lines +642 to +644
throw new InferenceClientProviderOutputError(
`Received malformed response from Fal.ai text-to-audio API: expected { audio_file: { url: string } } result format, got instead: ${JSON.stringify(
result,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw new InferenceClientProviderOutputError(
`Received malformed response from Fal.ai text-to-audio API: expected { audio_file: { url: string } } result format, got instead: ${JSON.stringify(
result,
throw new InferenceClientProviderOutputError(
`Received malformed response from Fal.ai text-to-audio API: expected { audio_file: { url: string } } or { audio: { url: string } } result format, got instead: ${JSON.stringify(
result,

@jetjodh

jetjodh commented Jun 22, 2026

Copy link
Copy Markdown
Author

Thanks @jetjodh! did you test text-to-audio via live testing? Also would it be possible to split the audio-to-audio and text-to-audio support addition into two separate PRs? 🙏

can do that

Remove the text-to-audio additions from this PR; they now live in a dedicated
PR (huggingface#2249). This PR keeps the audio-to-audio fix and
the nemotron ASR `output`/timestamps handling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jetjodh

jetjodh commented Jun 22, 2026

Copy link
Copy Markdown
Author

@hanouticelina Done 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants