Skip to content

Conversation

@gorkachea
Copy link
Contributor

docs(granite-speech): add comprehensive usage examples

Resolves the TODO (@alex-jw-brooks) by adding complete usage documentation for Granite Speech model now that it's released and compatible with transformers.

Added examples for:

  • Basic speech transcription
  • Speech-to-text with additional context
  • Batch processing multiple audio files
  • Tips for best results (audio format, LoRA adapter, memory optimization)

This helps users get started with the Granite Speech multimodal model by providing practical, copy-paste-ready code examples for common use cases.

Replaces TODO comment on line 44 with ~100 lines of comprehensive documentation following the patterns used in other multimodal model docs.

What does this PR do?

This PR resolves an existing TODO in the Granite Speech documentation by adding comprehensive usage examples now that the model is released and compatible with transformers.

Problem: The documentation contained a placeholder TODO comment asking for usage examples once the model was released. Since ibm-granite/granite-3.2-8b-speech is now available, users need practical examples to get started.

Solution: Added a complete "Usage example" section with three practical code examples:

  1. Basic speech transcription - simple audio-to-text
  2. Speech-to-text with additional context - using text prompts with audio
  3. Batch processing - efficient handling of multiple audio files
  4. Tips section - audio format, LoRA adapter info, optimization guidance

Impact: Users can now quickly get started with Granite Speech without searching external resources. The examples follow the same structure as other multimodal models (SAM2, SmolVLM, LLaVA) for consistency.

Before submitting

  • This PR fixes a typo or improves the docs - Yes, this is documentation enhancement
  • Did you read the contributor guideline, Pull Request section? Yes
  • Was this discussed/approved via a Github issue or the forum? No - this directly resolves an existing TODO comment left by @alex-jw-brooks in the documentation
  • Did you make sure to update the documentation with your changes? Yes - this PR is entirely documentation
  • Did you write any new necessary tests? N/A - documentation only change, no code changes

Who can review?

@alex-jw-brooks (original TODO author)
@stevhliu (documentation)
@zucchini-nlp (multimodal models)


Additional Context:

This PR adds ~105 lines of documentation to replace a single TODO comment. All code examples:

  • Follow transformers conventions
  • Are syntactically correct
  • Match patterns used in other multimodal model docs
  • Cover the most common use cases for this model

The Granite Speech model is a multimodal speech-to-text model from IBM that's now fully integrated into transformers, so users need these examples to get started effectively.

Resolves the TODO (@alex-jw-brooks) by adding complete usage documentation
for Granite Speech model now that it's released and compatible with transformers.

Added examples for:
- Basic speech transcription
- Speech-to-text with additional context
- Batch processing multiple audio files
- Tips for best results (audio format, LoRA adapter, memory optimization)

This helps users get started with the Granite Speech multimodal model
by providing practical, copy-paste-ready code examples for common use cases.

Replaces TODO comment on line 44 with ~100 lines of comprehensive
documentation following the patterns used in other multimodal model docs.
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @eustlb for audio

Comment on lines 90 to 92
# Prepare inputs with text prompt
text_prompt = "Transcribe the following audio:"
audio_input = "path/to/audio.wav"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob we need to apply a chat template and format the prompt

Comment on lines 142 to 148
### Tips for Best Results

- **Audio Format**: The model expects 16kHz sampling rate audio. The processor will automatically resample if needed.
- **LoRA Adapter**: The LoRA adapter is automatically enabled when audio features are present, so you don't need to manage it manually.
- **Memory Usage**: For large models, use `torch.bfloat16` or quantization to reduce memory footprint.
- **Temperature**: Use lower temperatures (0.1-0.5) for accurate transcription, higher (0.7-1.0) for more creative responses.
- **Batch Size**: Adjust batch size based on available GPU memory. Larger batches improve throughput but require more memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo most of the tips are general knowledge about LLMs. If it is really important for GraniteSpeech, we can move it under Usage Tips

…fic tips

- Added proper chat template formatting in the second example (per @zucchini-nlp feedback)
- Removed generic LLM tips (temperature, batch size, memory)
- Moved Granite Speech-specific tips (audio format, LoRA adapter) to Usage tips section

This keeps the documentation focused on model-specific features rather than general LLM knowledge.
@gorkachea
Copy link
Contributor Author

Thanks for the review @zucchini-nlp! I've addressed both points:

  1. Chat template: Updated the second example to properly use apply_chat_template() for formatting the prompt with audio
  2. Generic tips: Removed general LLM advice (temperature, batch size, memory) and moved Granite Speech-specific tips (audio format, LoRA adapter) to the "Usage tips" section at the top

The documentation now focuses on model-specific features. Let me know if you'd like any other adjustments!

cc @eustlb as requested

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!


# Prepare audio input (16kHz sampling rate required)
# audio can be a file path, numpy array, or tensor
audio_input = "path/to/audio.wav"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's comment this and use datasets for have something that works out of the box copy pasted

from datasets import load_dataset, Audio

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
audio = ds['audio'][0]['array']

Comment on lines +98 to +108
text = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

inputs = processor(
text=text,
audio="path/to/audio.wav",
return_tensors="pt"
).to(model.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we do apply_chat_template with tokenize=True directly? That would be the standard way of doing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants