-
Notifications
You must be signed in to change notification settings - Fork 31.1k
📚 docs(granite-speech): add comprehensive usage examples #42125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
📚 docs(granite-speech): add comprehensive usage examples #42125
Conversation
Resolves the TODO (@alex-jw-brooks) by adding complete usage documentation for Granite Speech model now that it's released and compatible with transformers. Added examples for: - Basic speech transcription - Speech-to-text with additional context - Batch processing multiple audio files - Tips for best results (audio format, LoRA adapter, memory optimization) This helps users get started with the Granite Speech multimodal model by providing practical, copy-paste-ready code examples for common use cases. Replaces TODO comment on line 44 with ~100 lines of comprehensive documentation following the patterns used in other multimodal model docs.
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @eustlb for audio
| # Prepare inputs with text prompt | ||
| text_prompt = "Transcribe the following audio:" | ||
| audio_input = "path/to/audio.wav" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob we need to apply a chat template and format the prompt
| ### Tips for Best Results | ||
|
|
||
| - **Audio Format**: The model expects 16kHz sampling rate audio. The processor will automatically resample if needed. | ||
| - **LoRA Adapter**: The LoRA adapter is automatically enabled when audio features are present, so you don't need to manage it manually. | ||
| - **Memory Usage**: For large models, use `torch.bfloat16` or quantization to reduce memory footprint. | ||
| - **Temperature**: Use lower temperatures (0.1-0.5) for accurate transcription, higher (0.7-1.0) for more creative responses. | ||
| - **Batch Size**: Adjust batch size based on available GPU memory. Larger batches improve throughput but require more memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo most of the tips are general knowledge about LLMs. If it is really important for GraniteSpeech, we can move it under Usage Tips
…fic tips - Added proper chat template formatting in the second example (per @zucchini-nlp feedback) - Removed generic LLM tips (temperature, batch size, memory) - Moved Granite Speech-specific tips (audio format, LoRA adapter) to Usage tips section This keeps the documentation focused on model-specific features rather than general LLM knowledge.
|
Thanks for the review @zucchini-nlp! I've addressed both points:
The documentation now focuses on model-specific features. Let me know if you'd like any other adjustments! cc @eustlb as requested |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
|
|
||
| # Prepare audio input (16kHz sampling rate required) | ||
| # audio can be a file path, numpy array, or tensor | ||
| audio_input = "path/to/audio.wav" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's comment this and use datasets for have something that works out of the box copy pasted
from datasets import load_dataset, Audio
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
audio = ds['audio'][0]['array']| text = processor.tokenizer.apply_chat_template( | ||
| messages, | ||
| tokenize=False, | ||
| add_generation_prompt=True | ||
| ) | ||
|
|
||
| inputs = processor( | ||
| text=text, | ||
| audio="path/to/audio.wav", | ||
| return_tensors="pt" | ||
| ).to(model.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we do apply_chat_template with tokenize=True directly? That would be the standard way of doing it
docs(granite-speech): add comprehensive usage examples
Resolves the TODO (@alex-jw-brooks) by adding complete usage documentation for Granite Speech model now that it's released and compatible with transformers.
Added examples for:
This helps users get started with the Granite Speech multimodal model by providing practical, copy-paste-ready code examples for common use cases.
Replaces TODO comment on line 44 with ~100 lines of comprehensive documentation following the patterns used in other multimodal model docs.
What does this PR do?
This PR resolves an existing TODO in the Granite Speech documentation by adding comprehensive usage examples now that the model is released and compatible with transformers.
Problem: The documentation contained a placeholder TODO comment asking for usage examples once the model was released. Since
ibm-granite/granite-3.2-8b-speechis now available, users need practical examples to get started.Solution: Added a complete "Usage example" section with three practical code examples:
Impact: Users can now quickly get started with Granite Speech without searching external resources. The examples follow the same structure as other multimodal models (SAM2, SmolVLM, LLaVA) for consistency.
Before submitting
Who can review?
@alex-jw-brooks (original TODO author)
@stevhliu (documentation)
@zucchini-nlp (multimodal models)
Additional Context:
This PR adds ~105 lines of documentation to replace a single TODO comment. All code examples:
The Granite Speech model is a multimodal speech-to-text model from IBM that's now fully integrated into transformers, so users need these examples to get started effectively.