Final project for CPSC 185.
Fine-tuning a Qwen2.5-Omni-3B model to predict typed text from overhead video, with audio, of typing on a keyboard.
- The Conda environment is given at
environment.yml. Create the environment withconda env create -f environment.yml.
0training/collect_gui.pyis a GUI Python program that prompts sentences to type and records video and keystroke data.- Our keyboard video dataset, with full videos and keystroke timing data, is available at this HuggingFace dataset.
Sample Data
Text: The source said if approved, the authority would allow a transaction to be carried out.
788.mp4
- Install
llamafactoryviapip install -e ".[torch,metrics]"in the1training/LLaMA-Factorydirectory. - Use
1training/0train.ipynbto generate the augmented dataset and ensure that it's at1training/LLaMA-Factory/data/keyboard_videos. Look at the relative paths inkeyboard.jsonto understand the directory structure for the .mp4 and .wav files. - Run training via
1training/train.shwhich uses the configuration at1training/train_keyboard.yml.