Update tokenizer for llama3 (#144)

andriigrynenko · sergei-something-ai · web-flow · commit de1b0eac2039 · 2025-12-02T12:52:12.000-08:00
Co-authored-by: Sergei &lt;sergei@thinkingmachines.ai&gt;
diff --git a/AGENTS.md b/AGENTS.md
@@ -19,7 +19,7 @@ Working notes for future agents hacking on `tinker-cookbook`. Additional docs ca
   - Launch scripts define a CLI-facing `CLIConfig` (parsed by `chz`) that instantiates the richer training `Config`. This gives every recipe a consistent `python -m ... key=value` interface.
   - Env builders compose like `RLDatasetBuilder → EnvGroupBuilder → Env`. Groups let us share metadata (tags, pairwise comparisons) and center rewards across related rollouts.
 - **Completers:** algorithms interact with the `TokenCompleter` interface. `TinkerTokenCompleter` (wrapping a `SamplingClient`) is the default implementation, but evaluators may accept any `TokenCompleter` or `MessageCompleter`.
-- **Renderers & tokenizer utils:** pick the renderer that matches your tokenizer/model pair (e.g., `role_colon`, `llama3`, `qwen3`). `TrainOnWhat` controls which tokens get weight=1 in SFT. Tokenizers are cached via `tokenizer_utils.get_tokenizer`, with Llama-3 names remapped to `baseten/Meta-Llama-3-tokenizer` to bypass HF gating.
+- **Renderers & tokenizer utils:** pick the renderer that matches your tokenizer/model pair (e.g., `role_colon`, `llama3`, `qwen3`). `TrainOnWhat` controls which tokens get weight=1 in SFT. Tokenizers are cached via `tokenizer_utils.get_tokenizer`, with Llama-3 names remapped to `thinkingmachineslabinc/meta-llama-3-tokenizer` to bypass HF gating.
 - **Loss plumbing:** every `tinker.Datum` bundles a `model_input` plus `loss_fn_inputs` (`TensorData`). Use helpers such as `conversation_to_datum`, `datum_from_tokens_weights`, and `_remove_mask` instead of constructing dicts manually. Built-in losses: `cross_entropy`, `importance_sampling`, `ppo`; `forward_backward_custom` covers bespoke differentiable objectives.
 
 ## Conventions & Notation (from CONTRIBUTING)
diff --git a/tinker_cookbook/tokenizer_utils.py b/tinker_cookbook/tokenizer_utils.py
@@ -26,6 +26,6 @@ def get_tokenizer(model_name: str) -> Tokenizer:
 
     # Avoid gating of Llama 3 models:
     if model_name.startswith("meta-llama/Llama-3"):
-        model_name = "baseten/Meta-Llama-3-tokenizer"
+        model_name = "thinkingmachineslabinc/meta-llama-3-tokenizer"
 
     return AutoTokenizer.from_pretrained(model_name, use_fast=True)