Skip to content

Conversation

@andriigrynenko
Copy link
Contributor

No description provided.

@claude
Copy link

claude bot commented Dec 2, 2025

Claude finished @andriigrynenko's task —— View job


Code Review Complete ✅

Tasks

  • Gather context and read CLAUDE.md
  • Review the git diff to see exact changes
  • Read the full tokenizer_utils.py file
  • Search for references to tokenizer in codebase
  • Analyze the change for correctness and potential issues
  • Provide comprehensive review feedback

Summary

This PR updates the Llama-3 tokenizer mapping from baseten/Meta-Llama-3-tokenizer to Xenova/Meta-Llama-3.1-Tokenizer. The change is in the get_tokenizer() function in tinker_cookbook/tokenizer_utils.py:29.

Review Feedback

✅ Positives

  1. Addresses gating issue: The change maintains the original goal of bypassing Hugging Face's gating requirements for Llama-3 models
  2. Updated tokenizer version: Moving from Llama-3 to Llama-3.1 tokenizer provides better coverage for newer models in the cookbook (3.1, 3.2, 3.3)
  3. Minimal scope: The change is surgical and only affects tokenizer resolution

⚠️ Concerns & Questions

  1. Version specificity mismatch: The code checks for model_name.startswith("meta-llama/Llama-3") but maps ALL matching models to the Meta-Llama-3.1-Tokenizer. This means:

    • meta-llama/Llama-3.2-1B → uses 3.1 tokenizer ✓ (should be compatible)
    • meta-llama/Llama-3.1-8B → uses 3.1 tokenizer ✓ (correct)
    • Hypothetical meta-llama/Llama-3.0-7B → uses 3.1 tokenizer ⚠️ (could be problematic)
  2. Consistency with documentation: The AGENTS.md file (line 22) still references baseten/Meta-Llama-3-tokenizer in the documentation:

    Tokenizers are cached via tokenizer_utils.get_tokenizer, with Llama-3 names remapped to baseten/Meta-Llama-3-tokenizer to bypass HF gating.

    Action needed: This documentation should be updated to reflect the new mapping.

  3. Tokenizer compatibility: Are there any known differences between the baseten/Meta-Llama-3-tokenizer and Xenova/Meta-Llama-3.1-Tokenizer? If they have different vocab sizes or special tokens, this could affect:

    • Existing checkpoints that were trained with the old tokenizer
    • Reproducibility of previous experiments
    • Token alignment in renderers
  4. Missing rationale: The PR has no description explaining:

    • Why the change was necessary
    • Whether baseten/Meta-Llama-3-tokenizer is deprecated or has issues
    • Whether this has been tested with existing recipes

📝 Recommendations

  1. Add PR description explaining:

    • Reason for the tokenizer change
    • Any compatibility testing performed
    • Impact on existing checkpoints (if any)
  2. Update documentation in AGENTS.md (line 22) and CLAUDE.md (line 22) to reflect the new tokenizer mapping

  3. Consider adding a comment in the code explaining why Xenova's tokenizer is preferred:

    # Avoid gating of Llama 3 models by using Xenova's ungated 3.1 tokenizer
    # which is compatible with all Llama-3.x variants
    if model_name.startswith("meta-llama/Llama-3"):
        model_name = "Xenova/Meta-Llama-3.1-Tokenizer"
  4. Test coverage: Consider running the smoke tests to verify this doesn't break existing functionality:

    pytest tinker_cookbook/tests/smoke_tests.py -v

🔍 Files Requiring Updates

  1. AGENTS.md:22 - Update tokenizer reference
  2. CLAUDE.md:22 - Update tokenizer reference (duplicate content)

Verdict

Conditional approval - The code change itself looks correct and follows the existing pattern. However, the PR needs:

  1. A description explaining the rationale
  2. Documentation updates to match the code change
  3. Confirmation that this has been tested with at least one Llama-3.x model

The functional impact should be minimal since Llama-3.1 tokenizer is designed to be compatible with the Llama-3 family, but the documentation inconsistency should be addressed before merging.


@dot-agi
Copy link

dot-agi commented Dec 2, 2025

Can this be merged asap?

@andriigrynenko andriigrynenko merged commit de1b0ea into main Dec 2, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants