📚 docs(granite-speech): add comprehensive usage examples #42125

gorkachea · 2025-11-10T11:09:38Z

docs(granite-speech): add comprehensive usage examples

Resolves the TODO (@alex-jw-brooks) by adding complete usage documentation for Granite Speech model now that it's released and compatible with transformers.

Added examples for:

Basic speech transcription
Speech-to-text with additional context
Batch processing multiple audio files
Tips for best results (audio format, LoRA adapter, memory optimization)

This helps users get started with the Granite Speech multimodal model by providing practical, copy-paste-ready code examples for common use cases.

Replaces TODO comment on line 44 with ~100 lines of comprehensive documentation following the patterns used in other multimodal model docs.

What does this PR do?

This PR resolves an existing TODO in the Granite Speech documentation by adding comprehensive usage examples now that the model is released and compatible with transformers.

Problem: The documentation contained a placeholder TODO comment asking for usage examples once the model was released. Since ibm-granite/granite-3.2-8b-speech is now available, users need practical examples to get started.

Solution: Added a complete "Usage example" section with three practical code examples:

Basic speech transcription - simple audio-to-text
Speech-to-text with additional context - using text prompts with audio
Batch processing - efficient handling of multiple audio files
Tips section - audio format, LoRA adapter info, optimization guidance

Impact: Users can now quickly get started with Granite Speech without searching external resources. The examples follow the same structure as other multimodal models (SAM2, SmolVLM, LLaVA) for consistency.

Before submitting

This PR fixes a typo or improves the docs - Yes, this is documentation enhancement
Did you read the contributor guideline, Pull Request section? Yes
Was this discussed/approved via a Github issue or the forum? No - this directly resolves an existing TODO comment left by @alex-jw-brooks in the documentation
Did you make sure to update the documentation with your changes? Yes - this PR is entirely documentation
Did you write any new necessary tests? N/A - documentation only change, no code changes

Who can review?

@alex-jw-brooks (original TODO author)
@stevhliu (documentation)
@zucchini-nlp (multimodal models)

Additional Context:

This PR adds ~105 lines of documentation to replace a single TODO comment. All code examples:

Follow transformers conventions
Are syntactically correct
Match patterns used in other multimodal model docs
Cover the most common use cases for this model

The Granite Speech model is a multimodal speech-to-text model from IBM that's now fully integrated into transformers, so users need these examples to get started effectively.

@alex-jw-brooks

Resolves the TODO (@alex-jw-brooks) by adding complete usage documentation for Granite Speech model now that it's released and compatible with transformers. Added examples for: - Basic speech transcription - Speech-to-text with additional context - Batch processing multiple audio files - Tips for best results (audio format, LoRA adapter, memory optimization) This helps users get started with the Granite Speech multimodal model by providing practical, copy-paste-ready code examples for common use cases. Replaces TODO comment on line 44 with ~100 lines of comprehensive documentation following the patterns used in other multimodal model docs.

zucchini-nlp

cc @eustlb for audio

zucchini-nlp · 2025-11-10T11:53:37Z

docs/source/en/model_doc/granite_speech.md

+# Prepare inputs with text prompt
+text_prompt = "Transcribe the following audio:"
+audio_input = "path/to/audio.wav"


prob we need to apply a chat template and format the prompt

zucchini-nlp · 2025-11-10T11:54:22Z

docs/source/en/model_doc/granite_speech.md

+### Tips for Best Results
+
+- **Audio Format**: The model expects 16kHz sampling rate audio. The processor will automatically resample if needed.
+- **LoRA Adapter**: The LoRA adapter is automatically enabled when audio features are present, so you don't need to manage it manually.
+- **Memory Usage**: For large models, use `torch.bfloat16` or quantization to reduce memory footprint.
+- **Temperature**: Use lower temperatures (0.1-0.5) for accurate transcription, higher (0.7-1.0) for more creative responses.
+- **Batch Size**: Adjust batch size based on available GPU memory. Larger batches improve throughput but require more memory.


imo most of the tips are general knowledge about LLMs. If it is really important for GraniteSpeech, we can move it under Usage Tips

@zucchini-nlp

…fic tips - Added proper chat template formatting in the second example (per @zucchini-nlp feedback) - Removed generic LLM tips (temperature, batch size, memory) - Moved Granite Speech-specific tips (audio format, LoRA adapter) to Usage tips section This keeps the documentation focused on model-specific features rather than general LLM knowledge.

gorkachea · 2025-11-11T14:33:23Z

Thanks for the review @zucchini-nlp! I've addressed both points:

✅ Chat template: Updated the second example to properly use apply_chat_template() for formatting the prompt with audio
✅ Generic tips: Removed general LLM advice (temperature, batch size, memory) and moved Granite Speech-specific tips (audio format, LoRA adapter) to the "Usage tips" section at the top

The documentation now focuses on model-specific features. Let me know if you'd like any other adjustments!

cc @eustlb as requested

HuggingFaceDocBuilderDev · 2025-11-12T09:36:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

lgtm!

eustlb · 2025-11-13T13:18:34Z

docs/source/en/model_doc/granite_speech.md

+
+# Prepare audio input (16kHz sampling rate required)
+# audio can be a file path, numpy array, or tensor
+audio_input = "path/to/audio.wav"


let's comment this and use datasets for have something that works out of the box copy pasted

from datasets import load_dataset, Audio ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate)) audio = ds['audio'][0]['array']

eustlb · 2025-11-13T13:24:39Z

docs/source/en/model_doc/granite_speech.md

+text = processor.tokenizer.apply_chat_template(
+    messages, 
+    tokenize=False, 
+    add_generation_prompt=True
+)
+
+inputs = processor(
+    text=text,
+    audio="path/to/audio.wav",
+    return_tensors="pt"
+).to(model.device)


why can't we do apply_chat_template with tokenize=True directly? That would be the standard way of doing it

zucchini-nlp reviewed Nov 10, 2025

View reviewed changes

Merge branch 'main' into docs/add-granite-speech-usage-example

649ec2f

Merge branch 'main' into docs/add-granite-speech-usage-example

6693b33

zucchini-nlp approved these changes Nov 13, 2025

View reviewed changes

eustlb reviewed Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📚 docs(granite-speech): add comprehensive usage examples #42125

📚 docs(granite-speech): add comprehensive usage examples #42125

gorkachea commented Nov 10, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Nov 10, 2025

Uh oh!

zucchini-nlp Nov 10, 2025

Uh oh!

gorkachea commented Nov 11, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 12, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

eustlb Nov 13, 2025

Uh oh!

eustlb Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

📚 docs(granite-speech): add comprehensive usage examples #42125

Are you sure you want to change the base?

📚 docs(granite-speech): add comprehensive usage examples #42125

Conversation

gorkachea commented Nov 10, 2025

docs(granite-speech): add comprehensive usage examples

What does this PR do?

Before submitting

Who can review?

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

gorkachea commented Nov 11, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 12, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

eustlb Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

eustlb Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants