Skip to content

Fix ESM2 tokenizer export: use PreTrainedTokenizerFast instead of TokenizersBackend#1555

Open
svc-bionemo wants to merge 1 commit intoNVIDIA:mainfrom
svc-bionemo:svc-bionemo/fix-esm2-tokenizer-export
Open

Fix ESM2 tokenizer export: use PreTrainedTokenizerFast instead of TokenizersBackend#1555
svc-bionemo wants to merge 1 commit intoNVIDIA:mainfrom
svc-bionemo:svc-bionemo/fix-esm2-tokenizer-export

Conversation

@svc-bionemo
Copy link
Copy Markdown
Collaborator

Problem

The ESM2 export script produces a tokenizer_config.json with "tokenizer_class": "TokenizersBackend" when run with transformers 5.x (26.03 container). This class name is not in the AutoTokenizer registry, so loading exported models fails:

ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

This affects both the repo exports and the models already on the HF Hub (e.g. nvidia/esm2_t6_8M_UR50D).

Root Cause

Transformers 5.x renamed the internal fast tokenizer backend class to TokenizersBackend, but this name is not registered in TOKENIZER_MAPPING_NAMES. When save_pretrained() is called, it serializes this unresolvable class name.

Fix

After tokenizer.save_pretrained(), patch the saved tokenizer_config.json to:

  1. Set tokenizer_class to "PreTrainedTokenizerFast" (universally resolvable)
  2. Remove non-standard fields (backend, is_local) that transformers 5.x adds

Applied to both bionemo-recipes/models/esm2/export.py and its copy at bionemo-recipes/recipes/vllm_inference/esm2/export.py.

Note

The existing Hub models (nvidia/esm2_*) also need their tokenizer_config.json patched — this PR fixes the export script so future exports are correct.

…nedTokenizerFast

In transformers 5.x, AutoTokenizer serializes the class name as
"TokenizersBackend" which is not resolvable by AutoTokenizer.from_pretrained().
Patch the saved tokenizer_config.json after save_pretrained() to force
tokenizer_class="PreTrainedTokenizerFast" and remove non-standard fields.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 18, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 30ea7df4-6e5e-4bbd-a2be-3adf513753a2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant