Merge branch 'main' into use-loss-fn-type

Tiiiger · web-flow · commit 72b485795401 · 2025-12-02T13:55:42.000-08:00
diff --git a/AGENTS.md b/AGENTS.md
@@ -19,7 +19,7 @@ Working notes for future agents hacking on `tinker-cookbook`. Additional docs ca
   - Launch scripts define a CLI-facing `CLIConfig` (parsed by `chz`) that instantiates the richer training `Config`. This gives every recipe a consistent `python -m ... key=value` interface.
   - Env builders compose like `RLDatasetBuilder → EnvGroupBuilder → Env`. Groups let us share metadata (tags, pairwise comparisons) and center rewards across related rollouts.
 - **Completers:** algorithms interact with the `TokenCompleter` interface. `TinkerTokenCompleter` (wrapping a `SamplingClient`) is the default implementation, but evaluators may accept any `TokenCompleter` or `MessageCompleter`.
-- **Renderers & tokenizer utils:** pick the renderer that matches your tokenizer/model pair (e.g., `role_colon`, `llama3`, `qwen3`). `TrainOnWhat` controls which tokens get weight=1 in SFT. Tokenizers are cached via `tokenizer_utils.get_tokenizer`, with Llama-3 names remapped to `baseten/Meta-Llama-3-tokenizer` to bypass HF gating.
+- **Renderers & tokenizer utils:** pick the renderer that matches your tokenizer/model pair (e.g., `role_colon`, `llama3`, `qwen3`). `TrainOnWhat` controls which tokens get weight=1 in SFT. Tokenizers are cached via `tokenizer_utils.get_tokenizer`, with Llama-3 names remapped to `thinkingmachineslabinc/meta-llama-3-tokenizer` to bypass HF gating.
 - **Loss plumbing:** every `tinker.Datum` bundles a `model_input` plus `loss_fn_inputs` (`TensorData`). Use helpers such as `conversation_to_datum`, `datum_from_tokens_weights`, and `_remove_mask` instead of constructing dicts manually. Built-in losses: `cross_entropy`, `importance_sampling`, `ppo`; `forward_backward_custom` covers bespoke differentiable objectives.
 
 ## Conventions & Notation (from CONTRIBUTING)
@@ -59,7 +59,7 @@ Working notes for future agents hacking on `tinker-cookbook`. Additional docs ca
 
 ### Evaluations & Sampling
 - Inline evaluators implement either `TrainingClientEvaluator` or `SamplingClientEvaluator`. Training loops accept builder lists (`evaluator_builders`, `infrequent_evaluator_builders`). Inspect AI integration is in `eval/inspect_evaluators.py` and `eval/run_inspect_evals.py`.
-- Sampling clients come from `training_client.save_weights_and_get_sampling_client(name=...)`. To export weights, use `RestClient.download_checkpoint_archive_from_tinker_path`.
+- Sampling clients come from `training_client.save_weights_and_get_sampling_client(name=...)`. To export weights, use `RestClient.get_checkpoint_archive_url_from_tinker_path`.
 
 ## Async & Performance
 - Worker pools advance in ~10s clock cycles. Submit `forward_backward_async` and `optim_step_async` back-to-back, then await both futures to keep them on the same cycle.
diff --git a/README.md b/README.md
@@ -40,7 +40,7 @@ See [tinker_cookbook/recipes/sl_loop.py](tinker_cookbook/recipes/sl_loop.py) and
 To download the weights of any model:
 ```python
 rest_client = service_client.create_rest_client()
-future = rest_client.download_checkpoint_archive_from_tinker_path(sampling_client.model_path)
+future = rest_client.get_checkpoint_archive_url_from_tinker_path(sampling_client.model_path)
 with open(f"model-checkpoint.tar.gz", "wb") as f:
     f.write(future.result())
 ```
diff --git a/llms-full.txt b/llms-full.txt
@@ -607,11 +607,12 @@ We'll start with a couple of general pages that'll be relevant to almost all of
 
 # Saving and loading weights and optimizer state
 
-During training, you'll need to save checkpoints for two main purposes: *sampling* (to test your model) and *resuming training* (to continue from where you left off). The `TrainingClient` provides three methods to handle these cases:
+During training, you'll need to save checkpoints for two main purposes: *sampling* (to test your model) and *resuming training* (to continue from where you left off). The `TrainingClient` provides these methods to handle these cases:
 
 1. `save_weights_for_sampler()`: saves a copy of the model weights that can be used for sampling.
 2. `save_state()`: saves the weights and the optimizer state. You can fully resume training from this checkpoint.
-3. `load_state()`: load the weights and the optimizer state. You can fully resume training from this checkpoint.
+3. `load_state()`: load the model weights only (without optimizer state). Use this when you want to start fresh training from a checkpoint, e.g., starting DPO training from an SFT checkpoint.
+4. `load_state_with_optimizer()`: load the model weights and optimizer state. Use this when resuming interrupted training, as it restores the full training state including optimizer momentum.
 
 Note that (1) is faster and requires less storage space than (2).
 
@@ -644,24 +645,58 @@ sampling_client = training_client.save_weights_and_get_sampling_client(name="000
 
 ### Example: Saving to resume training
 
-Use `save_state()` and `load_state()` when you need to pause and continue training with full optimizer state preferred:
+Use `save_state()` and `load_state_with_optimizer()` when you need to pause and continue training with full optimizer state:
 
 ```python
 # Save a checkpoint that you can resume from
 resume_path = training_client.save_state(name="0010").result().path
 
-# Load that checkpoint
-training_client.load_state(resume_path)
+# Load that checkpoint with optimizer state (for resuming training)
+training_client.load_state_with_optimizer(resume_path)
 ```
 
-### When to use `save_state()` and `load_state()`:
+Async versions are also available: `load_state_with_optimizer_async()`.
 
+### Example: Starting fresh from a checkpoint
 
-- Multi-step training pipelines (e.g. supervised learning followed by reinforcement learning)
-- Adjusting hyperparameters or data mid-run
-- Recovery from interruptions or failures
+Use `load_state()` when you want to start a new training phase from saved weights (e.g., starting DPO from an SFT checkpoint):
+
+```python
+# Load weights only, starting with fresh optimizer state
+training_client.load_state(sft_checkpoint_path)
+```
+
+### When to use `load_state_with_optimizer()`:
+
+- Recovery from interruptions or failures (resume training exactly where you left off)
 - Any scenario where you need to preserve exact optimizer state (momentum, learning rate schedules, etc.)
 
+### When to use `load_state()`:
+
+- Multi-step training pipelines (e.g., starting DPO training from an SFT checkpoint)
+- Starting fresh training from pretrained weights with a new optimizer
+
+### ServiceClient methods for loading checkpoints
+
+The `ServiceClient` also provides methods to create a new `TrainingClient` directly from a saved checkpoint:
+
+- `create_training_client_from_state(path)`: Creates a `TrainingClient` with weights loaded from the checkpoint (no optimizer state). Use this when starting a new training phase from saved weights.
+- `create_training_client_from_state_with_optimizer(path)`: Creates a `TrainingClient` with both weights and optimizer state loaded. Use this when resuming interrupted training.
+
+```python
+# Resume training with optimizer state
+training_client = service_client.create_training_client_from_state_with_optimizer(
+    "tinker://run-id/weights/checkpoint-001"
+)
+
+# Start fresh training from a checkpoint (weights only)
+training_client = service_client.create_training_client_from_state(
+    "tinker://run-id/weights/checkpoint-001"
+)
+```
+
+Async versions are also available: `create_training_client_from_state_async()` and `create_training_client_from_state_with_optimizer_async()`.
+
 
 ---
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -9,11 +9,12 @@ authors = [
 requires-python = ">=3.11"
 dependencies = [
     "chz",
+    "cloudpickle",
     "datasets",
     "numpy",
     "rich",
     "termcolor",
-    "tinker>=0.5.1",
+    "tinker>=0.6.1",
     "torch",
     "transformers",
     "blobfile",
diff --git a/tinker_cookbook/distillation/train_on_policy.py b/tinker_cookbook/distillation/train_on_policy.py
@@ -386,7 +386,7 @@ async def main(
         resume_info["state_path"] if resume_info else cfg.load_checkpoint_path
     )
     if load_state_path:
-        future = await training_client.load_state_async(load_state_path)
+        future = await training_client.load_state_with_optimizer_async(load_state_path)
         _ = await future.result_async()
         logger.info(f"Loaded state from {load_state_path}")
 
diff --git a/tinker_cookbook/preference/train_dpo.py b/tinker_cookbook/preference/train_dpo.py
@@ -91,14 +91,15 @@ def create_dpo_clients(
         base_model=config.model_name, rank=config.lora_rank
     )
 
-    # Load state first to get the SFT checkpoint path for the reference client
-    load_state_path: str | None = (
-        resume_info["state_path"] if resume_info else config.load_checkpoint_path
-    )
-    if load_state_path:
-        # Load state into the training client
-        training_client.load_state(load_state_path).result()
-        logger.info(f"Loaded weights from {load_state_path}")
+    # Load state - differentiate between resuming DPO training vs starting fresh from SFT
+    if resume_info:
+        # Resuming interrupted DPO training - load optimizer state for proper continuation
+        training_client.load_state_with_optimizer(resume_info["state_path"]).result()
+        logger.info(f"Resumed DPO training from {resume_info['state_path']}")
+    elif config.load_checkpoint_path:
+        # Starting fresh DPO from SFT checkpoint - load weights only (fresh optimizer)
+        training_client.load_state(config.load_checkpoint_path).result()
+        logger.info(f"Loaded weights from {config.load_checkpoint_path}")
     # Create a sampling client for the reference model from the training client
     reference_client = training_client.save_weights_and_get_sampling_client("reference")
     return training_client, reference_client
diff --git a/tinker_cookbook/recipes/rl_loop.py b/tinker_cookbook/recipes/rl_loop.py
@@ -83,7 +83,7 @@ def main(config: Config):
 
     resume_info = checkpoint_utils.get_last_checkpoint(config.log_path)
     if resume_info:
-        training_client = service_client.create_training_client_from_state(
+        training_client = service_client.create_training_client_from_state_with_optimizer(
             resume_info["state_path"]
         )
         start_batch = resume_info["batch"]
diff --git a/tinker_cookbook/recipes/sl_loop.py b/tinker_cookbook/recipes/sl_loop.py
@@ -63,7 +63,7 @@ def main(config: Config):
     # Check for resuming
     resume_info = checkpoint_utils.get_last_checkpoint(config.log_path)
     if resume_info:
-        training_client = service_client.create_training_client_from_state(
+        training_client = service_client.create_training_client_from_state_with_optimizer(
             resume_info["state_path"]
         )
         start_batch = resume_info["batch"]
diff --git a/tinker_cookbook/renderers.py b/tinker_cookbook/renderers.py
@@ -625,13 +625,18 @@ def _render_message(self, idx: int, message: Message) -> tuple[list[int], list[i
 class DeepSeekV3Renderer(Renderer):
     """
     Format like this (no newlines between messages):
-        <|begin_of_sentence|><|User|>What can you help me with?<|Assistant|><think>Thinking...</think>I can help you with...<|end_of_centence|>
+        <|begin_of_sentence|><|User|>What can you help me with?<|Assistant|><think>Thinking...</think>I can help you with...<|end_of_sentence|>
     For no-think, just use <|Assistant|></think>
+    Deepseek renderer does not support the system role out of the box. You can set system_role_as_user to True to automatically convert the system role to the user role.
     """
 
+    def __init__(self, tokenizer: Tokenizer, system_role_as_user: bool = False):
+        super().__init__(tokenizer)
+        self.system_role_as_user = system_role_as_user
+
     def _render_message(self, message: Message) -> tuple[list[int], list[int], list[int]]:
         assert message.get("thinking") is None, "TODO: support CoT in DsV3 renderer"
-        if message["role"] == "user":
+        if message["role"] == "user" or (self.system_role_as_user and message["role"] == "system"):
             role_token = self._get_special_token("User")
         elif message["role"] == "assistant":
             role_token = self._get_special_token("Assistant")
diff --git a/tinker_cookbook/rl/train.py b/tinker_cookbook/rl/train.py
@@ -1058,14 +1058,20 @@ async def main(
         start_batch = 0
 
     service_client = tinker.ServiceClient(base_url=cfg.base_url)
-    load_state_path: str | None = (
-        resume_info["state_path"] if resume_info else cfg.load_checkpoint_path
-    )
-    if load_state_path:
+    if resume_info:
+        # Resuming interrupted training - load optimizer state for proper continuation
+        training_client = (
+            await service_client.create_training_client_from_state_with_optimizer_async(
+                resume_info["state_path"]
+            )
+        )
+        logger.info(f"Resumed training from {resume_info['state_path']}")
+    elif cfg.load_checkpoint_path:
+        # Starting fresh from a checkpoint - load weights only (fresh optimizer)
         training_client = await service_client.create_training_client_from_state_async(
-            load_state_path
+            cfg.load_checkpoint_path
         )
-        logger.info(f"Loaded state from {load_state_path}")
+        logger.info(f"Loaded weights from {cfg.load_checkpoint_path}")
     else:
         training_client = await service_client.create_lora_training_client_async(
             cfg.model_name, rank=cfg.lora_rank
diff --git a/tinker_cookbook/supervised/train.py b/tinker_cookbook/supervised/train.py
@@ -189,19 +189,25 @@ async def main(config: Config):
         trace_init(output_file=os.path.join(config.log_path, "trace_events.jsonl"))
 
     service_client = tinker.ServiceClient(base_url=config.base_url)
-    load_state_path: str | None = (
-        resume_info["state_path"] if resume_info else config.load_checkpoint_path
-    )
 
     user_metadata: dict[str, str] = {}
     if wandb_link := ml_logger.get_logger_url():
         user_metadata["wandb_link"] = wandb_link
 
-    if load_state_path:
+    if resume_info:
+        # Resuming interrupted training - load optimizer state for proper continuation
+        training_client = (
+            await service_client.create_training_client_from_state_with_optimizer_async(
+                resume_info["state_path"], user_metadata
+            )
+        )
+        logger.info(f"Resumed training from {resume_info['state_path']}")
+    elif config.load_checkpoint_path:
+        # Starting fresh from a checkpoint - load weights only (fresh optimizer)
         training_client = await service_client.create_training_client_from_state_async(
-            load_state_path, user_metadata
+            config.load_checkpoint_path, user_metadata
         )
-        logger.info(f"Loaded weights from {load_state_path}")
+        logger.info(f"Loaded weights from {config.load_checkpoint_path}")
     else:
         training_client = await service_client.create_lora_training_client_async(
             base_model=config.model_name,
diff --git a/tinker_cookbook/tokenizer_utils.py b/tinker_cookbook/tokenizer_utils.py
@@ -26,6 +26,6 @@ def get_tokenizer(model_name: str) -> Tokenizer:
 
     # Avoid gating of Llama 3 models:
     if model_name.startswith("meta-llama/Llama-3"):
-        model_name = "baseten/Meta-Llama-3-tokenizer"
+        model_name = "thinkingmachineslabinc/meta-llama-3-tokenizer"
 
     return AutoTokenizer.from_pretrained(model_name, use_fast=True)
diff --git a/tinker_cookbook/xmux/README.md b/tinker_cookbook/xmux/README.md
@@ -0,0 +1,93 @@
+# xmux - TMUX-based Experiment Launcher
+
+xmux is a tool for launching and managing hierarchical ML experiments using TMUX. It provides an interactive control window for monitoring and managing large numbers of concurrent experiments.
+
+## Key Features
+
+- **Hierarchical Organization**: Session = Sweep, with a control window for management
+- **Smart Grouping**: Group related experiments in the same window as panes
+- **Interactive Control**: Navigate, monitor, and kill experiments from the control window
+- **Smart Naming**: Automatic abbreviation of long experiment names
+- **Multi-line Status Bar**: Clear overview of all running experiments
+
+## Quick Start
+
+```python
+from tinker_cookbook.xmux import JobSpec, SwarmConfig, launch_swarm
+
+# Define your experiments
+job_specs = [
+    JobSpec(
+        main_fn=train_model,  # Your training function
+        log_relpath="sweep/model1/lr0.001",
+        entrypoint_config={"model": "bert", "lr": 0.001}
+    ),
+    # ... more experiments
+]
+
+# Launch the swarm
+config = SwarmConfig(sweep_name="my-lr-sweep")
+launch_swarm(job_specs, config)
+```
+
+## Grouping Experiments
+
+You can group related experiments into the same window:
+
+```python
+# Group by model type
+JobSpec(
+    main_fn=train_model,
+    log_relpath="sweep/bert/lr0.001",
+    entrypoint_config=config,
+    tmux_window_name="bert",  # Groups all BERT experiments
+    pane_title="lr0.001"      # Shows in the pane
+)
+```
+
+## Using the Control Window
+
+After launching, attach to the TMUX session:
+
+```bash
+tmux attach-session -t my-lr-sweep
+```
+
+Control window commands:
+- **0-9**: Jump to window by number
+- **↑↓**: Navigate job list
+- **k**: Kill selected job
+- **K**: Kill entire window group
+- **r**: Refresh status
+- **q**: Quit control window
+
+## Adding to an Existing Experiment
+
+If you already have an existing session, you can add
+additional jobs to the experiment by using the same
+sweep name.
+
+## Examples
+
+See `examples/ml_sweep.py` for complete examples:
+
+```bash
+# Run demo with dry-run to see what would happen
+python examples/ml_sweep.py 1 --dry-run
+
+# Run actual experiments
+python examples/ml_sweep.py 2
+
+# Demo options:
+# 1 - Individual windows (no grouping)
+# 2 - Grouped by model
+# 3 - Mixed grouping strategy
+# 4 - Large scale sweep (72 experiments)
+```
+
+## Tips
+
+1. **Kill entire sweep**: `tmux kill-session -t sweep-name`
+2. **List xmux sessions**: Look for sessions with metadata in `~/experiments/.xmux/`
+3. **Window limit**: Use grouping for large sweeps to avoid too many windows
+4. **Pane limit**: Set `max_panes_per_window` to control pane overflow
diff --git a/tinker_cookbook/xmux/__init__.py b/tinker_cookbook/xmux/__init__.py
@@ -0,0 +1,6 @@
+"""xmux - TMUX-based experiment launcher for ML sweeps"""
+
+from .core import JobSpec, SwarmConfig, launch_swarm
+
+__version__ = "0.1.0"
+__all__ = ["JobSpec", "SwarmConfig", "launch_swarm"]
diff --git a/tinker_cookbook/xmux/control.py b/tinker_cookbook/xmux/control.py
diff --git a/tinker_cookbook/xmux/core.py b/tinker_cookbook/xmux/core.py
diff --git a/tinker_cookbook/xmux/examples/async_rl_sweep.py b/tinker_cookbook/xmux/examples/async_rl_sweep.py
diff --git a/tinker_cookbook/xmux/examples/fake_train.py b/tinker_cookbook/xmux/examples/fake_train.py
diff --git a/tinker_cookbook/xmux/examples/ml_sweep.py b/tinker_cookbook/xmux/examples/ml_sweep.py
diff --git a/tinker_cookbook/xmux/run_job.py b/tinker_cookbook/xmux/run_job.py
diff --git a/tinker_cookbook/xmux/utils.py b/tinker_cookbook/xmux/utils.py

Original file line number	Diff line number	Diff line change
`@@ -386,7 +386,7 @@ async def main(`
`386`	`386`	`resume_info["state_path"] if resume_info else cfg.load_checkpoint_path`
`387`	`387`	`)`
`388`	`388`	`if load_state_path:`
`389`		`- future = await training_client.load_state_async(load_state_path)`
	`389`	`+ future = await training_client.load_state_with_optimizer_async(load_state_path)`
`390`	`390`	`_ = await future.result_async()`
`391`	`391`	`logger.info(f"Loaded state from {load_state_path}")`
`392`	`392`
Original file line number	Diff line number	Diff line change
`@@ -83,7 +83,7 @@ def main(config: Config):`
`83`	`83`
`84`	`84`	`resume_info = checkpoint_utils.get_last_checkpoint(config.log_path)`
`85`	`85`	`if resume_info:`
`86`		`- training_client = service_client.create_training_client_from_state(`
	`86`	`+ training_client = service_client.create_training_client_from_state_with_optimizer(`
`87`	`87`	`resume_info["state_path"]`
`88`	`88`	`)`
`89`	`89`	`start_batch = resume_info["batch"]`
Original file line number	Diff line number	Diff line change
`@@ -63,7 +63,7 @@ def main(config: Config):`
`63`	`63`	`# Check for resuming`
`64`	`64`	`resume_info = checkpoint_utils.get_last_checkpoint(config.log_path)`
`65`	`65`	`if resume_info:`
`66`		`- training_client = service_client.create_training_client_from_state(`
	`66`	`+ training_client = service_client.create_training_client_from_state_with_optimizer(`
`67`	`67`	`resume_info["state_path"]`
`68`	`68`	`)`
`69`	`69`	`start_batch = resume_info["batch"]`