Important
This repository contains the code for our paper on Prompt Projector for LLM-based ASR. Our work builds upon the SLAM-LLM framework, specifically extending the ASR example and is based on a cloned version of the original repo.
Quick Links: Extended Results · Scripts · Installation · License
LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space, without modifying the underlying LLM-based ASR model. Experiments on four datasets show that the addition of a prompt projector consistently improves performance, reduces variability, and outperforms the best manually selected prompts.
Important
Complementary experimental results referenced in the paper are consolidated in appendix.pdf.
The appendix includes: a comparison of Vicuna-7B vs. LLaMA3-8B performance in the LLM-based ASR setup, an analysis of the impact of k (the number of learnable embeddings) in prompt tuning experiments, and a series of experiments supporting the importance of freezing or not freezing the underlying model while training the Prompt Projector module.
Our prompt projector is implemented as a PEFT (Parameter-Efficient Fine-Tuning) mechanism by modifying the original HuggingFace PEFT library. We cloned and extended the PEFT repository to support the prompt projector as a new PEFT method. The peft/ directory in this repository contains our modified version with the prompt projector implementation.
We extended the original SLAM-LLM codebase with the following enhancements:
- Flexible
<speech>token placement: Unlike the original implementation where speech was always prepended to the prompt, our version allows the<speech>token to be placed anywhere in the prompt template. - Learnable token insertion with
<p:N>: Insert N learnable tokens at any position in the prompt using the<p:N>syntax (e.g.,<p:5>inserts 5 learnable tokens). - Text-initialized learnable tokens with
<p:TEXT>: Initialize learnable tokens with user-provided text using the<p:TEXT>syntax (e.g.,<p:Transcribe the following audio>).
Added the following settings to train_config:
save_checkpoint_only_at_epoch_end(bool): Save checkpoints only at the end of each epoch instead of at every checkpoint interval.freeze_projector(bool): Freeze the projector module during training.use_bf16(bool): Enable bfloat16 training format.peft_config.peft_method: Added support for"p-projector"as valid PEFT method.prompt_token(str): Token used to mark learnable positions in prompts (default:"<p>").prompt_num_virtual_tokens(int): Number of virtual tokens for "prefix" method.
We modified the original HuggingFace PEFT library to add:
- P_PROJECTOR: A new PEFT type that implements our prompt projector mechanism.
- User-provided embedding initialization: Support for initializing prompt embeddings with custom embeddings via the
virtual_token_embsparameter inPromptEncoder.
We provide a conda environment file for easy setup. To create and activate the environment:
conda env create -f environment.yml
conda activate slam_llmTip
For additional installation details, troubleshooting, or alternative setup methods, please refer to the original SLAM-LLM repository.
To reproduce the experiments from our paper:
- First, read the two subsections below to understand the input data format and where to download the required model binaries.
- Then, refer to the
scripts/folder, which contains all necessary training and evaluation scripts along with a detailed README explaining the complete workflow.
You need to prepare the data jsonl in this format.
{"key": "1001-134707-0000_ASR", "source": "/data/open_data/librispeech_audio/audio/librispeech_1001-134707-0000.wav", "target": "1 little recks the laborer. How near his work is holding him to God, The loving laborer through space and time, after all, not to create, only or found only."}
...
{"key": "1001-134707-0000_ASR", "source": "/data/open_data/librispeech_audio/audio/librispeech_1001-134707-0000.wav", "target": "1 little recks the laborer. How near his work is holding him to God, The loving laborer through space and time, after all, not to create, only or found only."}
- Speech encoder: WavLM-large
- LLM: vicuna-7b-v1.5
For more details on the base SLAM-LLM framework, please refer to the README.md of the original repo.
Our original code additions and modifications (including the prompt projector mechanism, enhanced prompt handling, new training configuration options, and PEFT extensions) are licensed under the MIT License. Portions of this repository remain under their original upstream licenses:
- SLAM-LLM original code: MIT License
- HuggingFace components and PEFT upstream code: Apache License 2.0
- Certain integrated logic referencing Llama materials (e.g., adaptation utilities) is subject to the Llama 2 Community License
Please consult individual file headers and the LICENSES/ directory for the full text of each applicable license:
LICENSES/MIT.txtLICENSES/APACHE-2.0.txtLICENSES/LLAMA2.txt
Files we modified have a header block that explicitly lists the changes; only those listed modifications are under our MIT license. Everything else in those files keeps its original upstream license. Check each file header and the LICENSES/ directory for exact licensing.