You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
load_state() does not actually load optimizer state. The API has been fixed to make loading optimizer state explicit.
This commit migrates load_state() callers to use
load_state_with_optimizer() when appropriate. Next we'll fix callers of create_training_client_from_state().
Signed-off-by: Daniel Xu <[email protected]>
Copy file name to clipboardExpand all lines: llms-full.txt
+23-9Lines changed: 23 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -607,11 +607,12 @@ We'll start with a couple of general pages that'll be relevant to almost all of
607
607
608
608
# Saving and loading weights and optimizer state
609
609
610
-
During training, you'll need to save checkpoints for two main purposes: *sampling* (to test your model) and *resuming training* (to continue from where you left off). The `TrainingClient` provides three methods to handle these cases:
610
+
During training, you'll need to save checkpoints for two main purposes: *sampling* (to test your model) and *resuming training* (to continue from where you left off). The `TrainingClient` provides these methods to handle these cases:
611
611
612
612
1. `save_weights_for_sampler()`: saves a copy of the model weights that can be used for sampling.
613
613
2. `save_state()`: saves the weights and the optimizer state. You can fully resume training from this checkpoint.
614
-
3. `load_state()`: load the weights and the optimizer state. You can fully resume training from this checkpoint.
614
+
3. `load_state()`: load the model weights only (without optimizer state). Use this when you want to start fresh training from a checkpoint, e.g., starting DPO training from an SFT checkpoint.
615
+
4. `load_state_with_optimizer()`: load the model weights and optimizer state. Use this when resuming interrupted training, as it restores the full training state including optimizer momentum.
615
616
616
617
Note that (1) is faster and requires less storage space than (2).
0 commit comments