Skip to content

Commit 2f22a9e

Browse files
authored
Image finetuning (#1012)
2 parents 0e00335 + e0e9621 commit 2f22a9e

File tree

11 files changed

+1638
-1
lines changed

11 files changed

+1638
-1
lines changed

.github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1561,3 +1561,21 @@ VSCode
15611561
applyTo
15621562
mdc
15631563
windsurfrules
1564+
FPFT
1565+
PagedAdamW
1566+
Torchtune
1567+
WandB
1568+
bfclv
1569+
bwd
1570+
cellpadding
1571+
cellspacing
1572+
chartqa
1573+
checkpointer
1574+
docvqa
1575+
enconder
1576+
gsm
1577+
preprocessors
1578+
seqs
1579+
torchtune
1580+
verifications
1581+
Waitlist

getting-started/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,5 +21,6 @@ If you are new to developing with Meta Llama models, this is where you should st
2121
* The [Build_with_Llama API](./build_with_llama_api.ipynb) notebook highlights some of the features of [Llama API](https://llama.developer.meta.com?utm_source=llama-cookbook&utm_medium=readme&utm_campaign=getting_started).
2222
* The [inference](./inference/) folder contains scripts to deploy Llama for inference on server and mobile. See also [3p_integrations/vllm](../3p-integrations/vllm/) and [3p_integrations/tgi](../3p-integrations/tgi/) for hosting Llama on open-source model servers.
2323
* The [RAG](./RAG/) folder contains a simple Retrieval-Augmented Generation application using Llama.
24-
* The [finetuning](./finetuning/) folder contains resources to help you finetune Llama on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-cookbook finetuning code found in [finetuning.py](../src/llama_cookbook/finetuning.py) which supports these features.
24+
* The [finetuning](./finetuning/) folder contains resources to help you finetune Llama on your custom datasets, for both single- and multi-GPU setups. The scripts use the native llama-cookbook finetuning code found in [finetuning.py](../src/llama_cookbook/finetuning.py) which supports these features:
25+
* **NEW:** [Vision fine-tuning recipe](./finetuning/vision/README.md) for Llama 3.2 11B Vision - Learn how to fine-tune multimodal models for document understanding with 98% accuracy on structured data extraction!
2526
* The [llama-tools](./llama-tools/) folder contains resources to help you use Llama tools, such as [llama-prompt-ops](../llama-tools/llama-prompt-ops_101.ipynb).

getting-started/finetuning/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ If you are new to fine-tuning techniques, check out [an overview](./LLM_finetuni
1313
> [!TIP]
1414
> If you want to try finetuning Meta Llama 3 in a Jupyter notebook you can find a quickstart notebook [here](./quickstart_peft_finetuning.ipynb)
1515
16+
> [!NOTE]
17+
> **New: Vision Fine-tuning Recipe** - Looking to fine-tune Llama 3.2 11B Vision for structured data extraction? Check out our comprehensive [vision fine-tuning recipe](./vision/README.md) that achieves 98% accuracy on document understanding tasks using torchtune, with detailed benchmarking comparing LoRA vs Full Parameter Fine-Tuning approaches.
18+
1619

1720
## How to configure finetuning settings?
1821

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
results/
2+
outputs/
3+
w2_evaluation_results/
4+
Llama-3.2-11B-Vision-Instruct/
5+
fake_w2_us_tax_form_dataset_train30_test70/
6+
fake_w2_us_tax_form_dataset_train80_test20/
7+
htmlcov/
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Top-level output directory
2+
output_dir: ./outputs/Llama-3.2-11B-Instruct-w2-full
3+
4+
# Model
5+
model:
6+
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
7+
decoder_trainable: False
8+
encoder_trainable: True
9+
fusion_trainable: True
10+
image_size: 560 # Make sure this matches the image_size in tokenizer
11+
12+
# Tokenizer / vision transform
13+
tokenizer:
14+
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
15+
path: ./Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
16+
image_size: 560
17+
max_seq_len: 8192
18+
19+
# Checkpointing
20+
checkpointer:
21+
_component_: torchtune.training.FullModelHFCheckpointer
22+
checkpoint_dir: ./Llama-3.2-11B-Vision-Instruct
23+
checkpoint_files:
24+
filename_format: model-{}-of-{}.safetensors
25+
max_filename: "00005"
26+
recipe_checkpoint: null
27+
output_dir: ${output_dir}
28+
model_type: LLAMA3_VISION
29+
30+
resume_from_checkpoint: false
31+
save_adapter_weights_only: False # PeFT formatting not available yet. This will save it in torchtune format only.
32+
33+
# Dataset
34+
dataset:
35+
_component_: torchtune.datasets.multimodal.vqa_dataset
36+
source: arrow
37+
data_files:
38+
train: "fake_w2_us_tax_form_dataset_train30_test70/train/data-00000-of-00001.arrow"
39+
split: train
40+
column_map:
41+
input: input
42+
output: ground_truth
43+
image: image
44+
45+
# General data handling
46+
seed: null
47+
shuffle: true
48+
collate_fn: torchtune.data.padded_collate_tiled_images_and_mask
49+
50+
# Training loop & hyperparams
51+
52+
epochs: 5
53+
max_steps_per_epoch: null
54+
batch_size: 1
55+
gradient_accumulation_steps: 1 # Use to increase effective batch size
56+
# explicit optimizer / scheduler / loss
57+
optimizer:
58+
_component_: bitsandbytes.optim.PagedAdamW8bit
59+
lr: 2e-5
60+
optimizer_in_bwd: True # True saves memory. Requires gradient_accumulation_steps=1
61+
62+
loss:
63+
_component_: torchtune.modules.loss.LinearCrossEntropyLoss
64+
65+
# clip_grad_norm: 1.0
66+
compile: true
67+
68+
# Device & memory
69+
device: cuda
70+
enable_activation_checkpointing: true
71+
dtype: bf16
72+
73+
# Logging
74+
75+
metric_logger:
76+
_component_: torchtune.training.metric_logging.WandBLogger
77+
project: llama3_2_w2_extraction
78+
entity: <your_wandb_entity>
79+
job_type: full_finetune_single_device
80+
group: llama-cookbook
81+
log_every_n_steps: 5
82+
save_steps: 100
83+
log_peak_memory_stats: true
84+
log_level: INFO
85+
86+
# Profiler (off by default)
87+
profiler:
88+
_component_: torchtune.training.setup_torch_profiler
89+
enabled: false
90+
output_dir: ${output_dir}/profiling_outputs
91+
cpu: true
92+
cuda: true
93+
profile_memory: false
94+
with_stack: false
95+
record_shapes: true
96+
with_flops: false
97+
wait_steps: 5
98+
warmup_steps: 3
99+
active_steps: 2
100+
num_cycles: 1
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Top-level output directory
2+
output_dir: ./outputs/Llama-3.2-11B-Instruct-w2-lora
3+
4+
# Model + LoRA settings
5+
model:
6+
_component_: torchtune.models.llama3_2_vision.lora_llama3_2_vision_11b
7+
# preserve your hyperparams
8+
lora_rank: 8 # higher increases accuracy and memory
9+
lora_alpha: 16 # usually alpha=2*rank
10+
lora_dropout: 0.05
11+
image_size: 560 # Make sure this matches the image_size in tokenizer
12+
# example’s fixed settings
13+
decoder_trainable: "frozen"
14+
encoder_trainable: "lora"
15+
fusion_trainable: "lora"
16+
lora_attn_modules:
17+
- 'q_proj'
18+
- 'v_proj'
19+
- 'output_proj'
20+
apply_lora_to_mlp: true
21+
apply_lora_to_output: false
22+
23+
# Tokenizer / vision transform
24+
tokenizer:
25+
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
26+
path: ./Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
27+
image_size: 560
28+
max_seq_len: 8192
29+
30+
# Checkpointing
31+
checkpointer:
32+
_component_: torchtune.training.FullModelHFCheckpointer
33+
checkpoint_dir: ./Llama-3.2-11B-Vision-Instruct
34+
checkpoint_files:
35+
filename_format: model-{}-of-{}.safetensors
36+
max_filename: "00005"
37+
recipe_checkpoint: null
38+
output_dir: ${output_dir}
39+
model_type: LLAMA3_VISION
40+
41+
resume_from_checkpoint: false
42+
save_adapter_weights_only: false # PeFT formatting not available yet. This will save it in torchtune format only.
43+
44+
# Dataset
45+
dataset:
46+
_component_: torchtune.datasets.multimodal.vqa_dataset
47+
source: arrow
48+
data_files:
49+
train: "fake_w2_us_tax_form_dataset_train30_test70/train/data-00000-of-00001.arrow"
50+
split: train
51+
column_map:
52+
input: input
53+
output: ground_truth
54+
image: image
55+
56+
# General data handling
57+
seed: null
58+
shuffle: true
59+
collate_fn: torchtune.data.padded_collate_tiled_images_and_mask
60+
61+
# Training loop & hyperparams
62+
63+
# example’s train-control
64+
epochs: 5
65+
max_steps_per_epoch: null
66+
batch_size: 1
67+
gradient_accumulation_steps: 1 # Use to increase effective batch size
68+
# explicit optimizer / scheduler / loss
69+
optimizer:
70+
_component_: torch.optim.AdamW
71+
fused: true
72+
weight_decay: 0.01
73+
lr: 1e-4
74+
optimizer_in_bwd: true
75+
76+
lr_scheduler:
77+
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
78+
num_warmup_steps: 100
79+
80+
loss:
81+
_component_: torchtune.modules.loss.LinearCrossEntropyLoss
82+
83+
clip_grad_norm: 1.0
84+
compile: false
85+
86+
# Device & memory
87+
device: cuda
88+
enable_activation_checkpointing: true
89+
dtype: bf16
90+
91+
# Logging
92+
93+
metric_logger:
94+
_component_: torchtune.training.metric_logging.WandBLogger
95+
project: llama3_2_w2_extraction
96+
entity: <your_wandb_entity>
97+
job_type: lora_finetune_single_device
98+
group: llama-cookbook
99+
log_every_n_steps: 5
100+
save_steps: 100
101+
log_peak_memory_stats: true
102+
log_level: INFO
103+
104+
# Profiler (off by default)
105+
profiler:
106+
_component_: torchtune.training.setup_torch_profiler
107+
enabled: false
108+
output_dir: ${output_dir}/profiling_outputs
109+
cpu: true
110+
cuda: true
111+
profile_memory: false
112+
with_stack: false
113+
record_shapes: true
114+
with_flops: false
115+
wait_steps: 5
116+
warmup_steps: 3
117+
active_steps: 2
118+
num_cycles: 1

0 commit comments

Comments
 (0)