Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 105 additions & 1 deletion docs/source/en/model_doc/granite_speech.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,111 @@ This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9

- This model bundles its own LoRA adapter, which will be automatically loaded and enabled/disabled as needed during inference calls. Be sure to install [PEFT](https://github.com/huggingface/peft) to ensure the LoRA is correctly applied!

<!-- TODO (@alex-jw-brooks) Add an example here once the model compatible with the transformers implementation is released -->
## Usage example

Granite Speech is a multimodal speech-to-text model that can transcribe audio and respond to text prompts. Here's how to use it:

### Basic Speech Transcription

```python
from transformers import GraniteSpeechForConditionalGeneration, GraniteSpeechProcessor
import torch

# Load model and processor
model = GraniteSpeechForConditionalGeneration.from_pretrained(
"ibm-granite/granite-3.2-8b-speech",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = GraniteSpeechProcessor.from_pretrained("ibm-granite/granite-3.2-8b-speech")

# Prepare audio input (16kHz sampling rate required)
# audio can be a file path, numpy array, or tensor
audio_input = "path/to/audio.wav"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's comment this and use datasets for have something that works out of the box copy pasted

from datasets import load_dataset, Audio

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
audio = ds['audio'][0]['array']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! All examples now use hf-internal-testing/librispeech_asr_dummy and work out-of-the-box


# Process audio
inputs = processor(audio=audio_input, return_tensors="pt").to(model.device)

# Generate transcription
generated_ids = model.generate(**inputs, max_new_tokens=256)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)
```

### Speech-to-Text with Additional Context

You can provide text context along with audio for more controlled generation:

```python
from transformers import GraniteSpeechForConditionalGeneration, GraniteSpeechProcessor
import torch

model = GraniteSpeechForConditionalGeneration.from_pretrained(
"ibm-granite/granite-3.2-8b-speech",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = GraniteSpeechProcessor.from_pretrained("ibm-granite/granite-3.2-8b-speech")

# Prepare inputs with text prompt
text_prompt = "Transcribe the following audio:"
audio_input = "path/to/audio.wav"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob we need to apply a chat template and format the prompt


inputs = processor(
text=text_prompt,
audio=audio_input,
return_tensors="pt"
).to(model.device)

# Generate with custom parameters
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9
)

output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output_text)
```

### Batch Processing

Process multiple audio files efficiently:

```python
from transformers import GraniteSpeechForConditionalGeneration, GraniteSpeechProcessor
import torch

model = GraniteSpeechForConditionalGeneration.from_pretrained(
"ibm-granite/granite-3.2-8b-speech",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = GraniteSpeechProcessor.from_pretrained("ibm-granite/granite-3.2-8b-speech")

# Multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]

# Process batch
inputs = processor(audio=audio_files, return_tensors="pt", padding=True).to(model.device)

# Generate for all inputs
generated_ids = model.generate(**inputs, max_new_tokens=256)
transcriptions = processor.batch_decode(generated_ids, skip_special_tokens=True)

for i, transcription in enumerate(transcriptions):
print(f"Audio {i+1}: {transcription}")
```

### Tips for Best Results

- **Audio Format**: The model expects 16kHz sampling rate audio. The processor will automatically resample if needed.
- **LoRA Adapter**: The LoRA adapter is automatically enabled when audio features are present, so you don't need to manage it manually.
- **Memory Usage**: For large models, use `torch.bfloat16` or quantization to reduce memory footprint.
- **Temperature**: Use lower temperatures (0.1-0.5) for accurate transcription, higher (0.7-1.0) for more creative responses.
- **Batch Size**: Adjust batch size based on available GPU memory. Larger batches improve throughput but require more memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo most of the tips are general knowledge about LLMs. If it is really important for GraniteSpeech, we can move it under Usage Tips


## GraniteSpeechConfig

Expand Down