Update tokenizer for llama3 #144

andriigrynenko · 2025-12-02T18:36:02Z

No description provided.

claude · 2025-12-02T18:36:20Z

Claude finished @andriigrynenko's task —— View job

Code Review Complete ✅

Tasks

Gather context and read CLAUDE.md
Review the git diff to see exact changes
Read the full tokenizer_utils.py file
Search for references to tokenizer in codebase
Analyze the change for correctness and potential issues
Provide comprehensive review feedback

Summary

This PR updates the Llama-3 tokenizer mapping from baseten/Meta-Llama-3-tokenizer to Xenova/Meta-Llama-3.1-Tokenizer. The change is in the get_tokenizer() function in tinker_cookbook/tokenizer_utils.py:29.

Review Feedback

✅ Positives

Addresses gating issue: The change maintains the original goal of bypassing Hugging Face's gating requirements for Llama-3 models
Updated tokenizer version: Moving from Llama-3 to Llama-3.1 tokenizer provides better coverage for newer models in the cookbook (3.1, 3.2, 3.3)
Minimal scope: The change is surgical and only affects tokenizer resolution

⚠️ Concerns & Questions

Version specificity mismatch: The code checks for model_name.startswith("meta-llama/Llama-3") but maps ALL matching models to the Meta-Llama-3.1-Tokenizer. This means:
- meta-llama/Llama-3.2-1B → uses 3.1 tokenizer ✓ (should be compatible)
- meta-llama/Llama-3.1-8B → uses 3.1 tokenizer ✓ (correct)
- Hypothetical meta-llama/Llama-3.0-7B → uses 3.1 tokenizer ⚠️ (could be problematic)
Consistency with documentation: The AGENTS.md file (line 22) still references baseten/Meta-Llama-3-tokenizer in the documentation:

Tokenizers are cached via tokenizer_utils.get_tokenizer, with Llama-3 names remapped to baseten/Meta-Llama-3-tokenizer to bypass HF gating.

Action needed: This documentation should be updated to reflect the new mapping.
Tokenizer compatibility: Are there any known differences between the baseten/Meta-Llama-3-tokenizer and Xenova/Meta-Llama-3.1-Tokenizer? If they have different vocab sizes or special tokens, this could affect:
- Existing checkpoints that were trained with the old tokenizer
- Reproducibility of previous experiments
- Token alignment in renderers
Missing rationale: The PR has no description explaining:
- Why the change was necessary
- Whether baseten/Meta-Llama-3-tokenizer is deprecated or has issues
- Whether this has been tested with existing recipes

📝 Recommendations

Add PR description explaining:
- Reason for the tokenizer change
- Any compatibility testing performed
- Impact on existing checkpoints (if any)
Update documentation in AGENTS.md (line 22) and CLAUDE.md (line 22) to reflect the new tokenizer mapping

Consider adding a comment in the code explaining why Xenova's tokenizer is preferred:

# Avoid gating of Llama 3 models by using Xenova's ungated 3.1 tokenizer
# which is compatible with all Llama-3.x variants
if model_name.startswith("meta-llama/Llama-3"):
    model_name = "Xenova/Meta-Llama-3.1-Tokenizer"

Test coverage: Consider running the smoke tests to verify this doesn't break existing functionality:
```
pytest tinker_cookbook/tests/smoke_tests.py -v
```

🔍 Files Requiring Updates

AGENTS.md:22 - Update tokenizer reference
CLAUDE.md:22 - Update tokenizer reference (duplicate content)

Verdict

Conditional approval - The code change itself looks correct and follows the existing pattern. However, the PR needs:

A description explaining the rationale
Documentation updates to match the code change
Confirmation that this has been tested with at least one Llama-3.x model

The functional impact should be minimal since Llama-3.1 tokenizer is designed to be compatible with the Llama-3 family, but the documentation inconsistency should be addressed before merging.

dot-agi · 2025-12-02T20:20:09Z

Can this be merged asap?

Update tokenizer for llama3

fb63165

andriigrynenko requested review from danobi, joschu and klshuster December 2, 2025 18:36

.

118e2cb

klshuster approved these changes Dec 2, 2025

View reviewed changes

dot-agi mentioned this pull request Dec 2, 2025

Remove reference to removed alternative llama 3 tokenizer #143

Closed

andriigrynenko merged commit de1b0ea into main Dec 2, 2025
2 checks passed

erikwijmans mentioned this pull request Dec 3, 2025

Issues with meta-llama/Llama-3.1-8B on getting the tokenizer #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update tokenizer for llama3 #144

Update tokenizer for llama3 #144

Uh oh!

andriigrynenko commented Dec 2, 2025

Uh oh!

claude bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

dot-agi commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Update tokenizer for llama3 #144

Update tokenizer for llama3 #144

Uh oh!

Conversation

andriigrynenko commented Dec 2, 2025

Uh oh!

claude bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Tasks

Summary

Review Feedback

✅ Positives

⚠️ Concerns & Questions

📝 Recommendations

🔍 Files Requiring Updates

Verdict

Uh oh!

dot-agi commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

claude bot commented Dec 2, 2025 •

edited

Loading