Fix ESM2 tokenizer export: use PreTrainedTokenizerFast instead of TokenizersBackend by svc-bionemo · Pull Request #1555 · NVIDIA/bionemo-framework

svc-bionemo · 2026-04-18T14:38:48Z

Problem

The ESM2 export script produces a tokenizer_config.json with "tokenizer_class": "TokenizersBackend" when run with transformers 5.x (26.03 container). This class name is not in the AutoTokenizer registry, so loading exported models fails:

ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.

This affects both the repo exports and the models already on the HF Hub (e.g. nvidia/esm2_t6_8M_UR50D).

Root Cause

Transformers 5.x renamed the internal fast tokenizer backend class to TokenizersBackend, but this name is not registered in TOKENIZER_MAPPING_NAMES. When save_pretrained() is called, it serializes this unresolvable class name.

Fix

After tokenizer.save_pretrained(), patch the saved tokenizer_config.json to:

Set tokenizer_class to "PreTrainedTokenizerFast" (universally resolvable)
Remove non-standard fields (backend, is_local) that transformers 5.x adds

Applied to both bionemo-recipes/models/esm2/export.py and its copy at bionemo-recipes/recipes/vllm_inference/esm2/export.py.

Note

The existing Hub models (nvidia/esm2_*) also need their tokenizer_config.json patched — this PR fixes the export script so future exports are correct.

…nedTokenizerFast In transformers 5.x, AutoTokenizer serializes the class name as "TokenizersBackend" which is not resolvable by AutoTokenizer.from_pretrained(). Patch the saved tokenizer_config.json after save_pretrained() to force tokenizer_class="PreTrainedTokenizerFast" and remove non-standard fields. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>

coderabbitai · 2026-04-18T14:38:55Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 30ea7df4-6e5e-4bbd-a2be-3adf513753a2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

svc-bionemo requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn, savitha-eng and trvachov as code owners April 18, 2026 14:38

svc-bionemo mentioned this pull request Apr 18, 2026

ci: exclude vllm_inference and megatron from nightly recipe CI #1554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ESM2 tokenizer export: use PreTrainedTokenizerFast instead of TokenizersBackend#1555

Fix ESM2 tokenizer export: use PreTrainedTokenizerFast instead of TokenizersBackend#1555
svc-bionemo wants to merge 1 commit intoNVIDIA:mainfrom
svc-bionemo:svc-bionemo/fix-esm2-tokenizer-export

svc-bionemo commented Apr 18, 2026

Uh oh!

coderabbitai Bot commented Apr 18, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

svc-bionemo commented Apr 18, 2026

Problem

Root Cause

Fix

Note

Uh oh!

coderabbitai Bot commented Apr 18, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant