Mixtral recipes by trvachov · Pull Request #1551 · NVIDIA/bionemo-framework

trvachov · 2026-04-15T19:06:33Z

Description

Adds two new self-contained training recipes for MoE (Mixture-of-Experts) models following the existing Llama3/ESM2 recipe patterns. Out of scope: Complete README with benchmarks -- that is work in progress for an upcoming PR.

New recipes:

bionemo-recipes/recipes/mixtral_native_te/ — TE-accelerated Mixtral FSDP2 training with Lingua-style DCLM Baseline pre-training config (8x1B and 8x7B). Includes DDP and FSDP2 entry points.
bionemo-recipes/recipes/opengenome2_mixtral_native_te/ — TE Mixtral for autoregressive DNA on OpenGenome2 metagenomes. Mirrors opengenome2_llama_native_te (THD packing, genomic label masking, validation,
nucleotide tokenizer packaged with recipe).

Key design decisions:

Self-contained KISS: duplicated fused MoE kernels (fused_a2a.py, fused_token_router.py, fused_indices_converter.py), collator, checkpoint, perf logger, etc. across both recipes rather than sharing, matching
repo convention.
Expert parallelism: configurable expert_parallel_size with all-to-all token dispatch; EP=1 default for single-GPU parity with Llama3 recipe.
MXFP8 alignment fix (f4013db): pad post-alltoall MoE expert input to multiple of 32 before GroupedLinear, attribute padding to last expert, slice off output. No-op for non-MXFP8/already-aligned. Verified on 8x
B300 with 8x7B EP=8 @ SEQ=8192.
Checkpointing: DCP format (.distcp files) for FSDP2; dedicated distributed checkpointing tests.
Review cleanup (136be26): removed 10 hardcoded-path dev configs, aligned Dockerfiles to 26.03 base, expanded train tests (3→7 single-GPU, 1→4 two-GPU), added dataset + checkpoint tests.
CI robustness (6ca0891): session-scoped local WordLevel tokenizer fixture to avoid HF Hub dependency.

Usage

On Blackwell, MXFP8 training, 8x7b

torchrun --standalone --nproc_per_node=8 train_fsdp2.py \
      --config-name defaults \
      \
      # --- Model ---
      config_name_or_path=./model_configs/mixtral-8x7B \
      +config_kwargs.attn_input_format=thd \
      +config_kwargs.self_attn_mask_type=padding_causal \
      +config_kwargs.max_position_embeddings=8192 \
      \
      # --- Precision: MXFP8 block scaling ---
      fp8_config.enabled=true \
      fp8_config.fp8_recipe=transformer_engine.common.recipe.MXFP8BlockScaling \
      \
      # --- Parallelism: pure EP=8 across the 8 ranks ---
      expert_parallel_size=8 \
      token_dispatcher=alltoall \
      \
      # --- THD sequence packing (max throughput on variable-length data) ---
      use_sequence_packing=true \
      use_meta_device=true \
      \
      # --- Data ---
      dataset.tokenizer_name_or_path=/path/to/tokenizer \
      ~dataset.micro_batch_size \
      +dataset.token_micro_batch_size=16384 \
      dataset.max_seq_length=8192 \
      dataset.stride=64 \
      dataset.pad_sequences_to_be_divisible_by=32 \
      dataset.load_dataset_kwargs.path=parquet \
      +dataset.load_dataset_kwargs.data_files=your_data.parquet \
      \
      # --- Training loop ---
      num_train_steps=500 \
      logger.frequency=5 \
      \
      # --- WandB ---
      wandb.project=mixtral-benchmark-sweep \
      wandb.name=te-mxfp8-thd-max \
      \
      # --- Checkpointing disabled for benchmarks ---
      checkpoint.ckpt_dir=null \
      checkpoint.save_final_model=false

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

copy-pr-bot · 2026-04-15T19:06:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-15T19:06:40Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fc965d79-2da8-43fb-bddc-dd5cbb72ac68

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch trvachov/mixtral-recipes-repush

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

trvachov · 2026-04-15T19:06:42Z

/ok to test 49e8426

trvachov · 2026-04-15T22:03:37Z

/ok to test 136be26

trvachov · 2026-04-15T22:38:45Z

/ok to test cad830b

trvachov · 2026-04-16T00:45:57Z

/ok to test 6ca0891

pstjohn · 2026-04-20T19:01:02Z

/ok to test 160a754

Two self-contained recipes following existing Llama3/ESM2 recipe conventions: - bionemo-recipes/recipes/mixtral_native_te/: TE-accelerated Mixtral FSDP2 training with a Lingua-style DCLM Baseline 1.0 pre-training config for Mixtral-8x1B and 8x7B. Includes DDP and FSDP2 entry points. - bionemo-recipes/recipes/opengenome2_mixtral_native_te/: TE Mixtral for autoregressive DNA on OpenGenome2 metagenomes, mirroring opengenome2_llama_native_te (THD packing, genomic label masking, validation, nucleotide tokenizer packaged with the recipe). Key design decisions: - Self-contained KISS: fused MoE kernels (fused_a2a, fused_token_router, fused_indices_converter), collator, checkpoint, and perf logger are duplicated across both recipes rather than shared, matching repo convention. - Configurable expert parallelism via all-to-all token dispatch; expert_parallel_size=1 by default for parity with the Llama3 recipe. - MXFP8 alignment: pad post-alltoall MoE expert input to a multiple of 32 before GroupedLinear (attribute padding to the last expert so m_splits sums correctly; slice padding off the output). No-op for non-MXFP8 and already-aligned batches. Verified on 8x B300 SXM6 with Mixtral-8x7B EP=8 at SEQ=8192: FP8 1.196 s/step, MXFP8 1.248 s/step. - FSDP2 checkpointing uses DCP format (.distcp files), covered by dedicated distributed checkpointing tests. - CI-robust tests: session-scoped local WordLevel tokenizer fixture avoids HuggingFace Hub dependency; expanded train coverage (7 single-GPU, 4 two-GPU tests per recipe) plus dataset and distributed checkpoint tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Timur Rvachov <trvachov@nvidia.com>

…VIDIA#1556) ## Problem The gitleaks pre-commit hook is silently passing in CI, even when secrets are present. See [NVIDIA#1551](NVIDIA#1551) which includes a hardcoded `WANDB_API_KEY` that gitleaks did not flag. **Root cause:** The default gitleaks hook entry is: ``` gitleaks git --pre-commit --redact --staged --verbose ``` This scans **staged git changes** — it works during an actual `git commit`. But in CI, `static_checks.sh` runs: ``` pre-commit run --all-files ``` With `--all-files`, there are no staged files and no commit context, so gitleaks scans **0 commits** and reports "no leaks found": ``` 7:02PM INF 0 commits scanned. 7:02PM INF scanned ~0 bytes (0) in 28.9ms 7:02PM INF no leaks found ``` ## Fix Override the hook entry to use `gitleaks dir --redact --verbose`, which scans **file contents** directly. This works correctly both: - Locally during `git commit` (pre-commit hook) - In CI with `pre-commit run --all-files` ## Testing After this change, running `pre-commit run gitleaks --all-files` on the repo will scan actual file contents instead of scanning 0 commits. --------- Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com> Signed-off-by: Peter St. John <pstjohn@nvidia.com> Co-authored-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com> Co-authored-by: Peter St. John <pstjohn@nvidia.com>

trvachov requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn and savitha-eng as code owners April 15, 2026 19:06

trvachov force-pushed the trvachov/mixtral-recipes-repush branch from 48ea1e1 to 160a754 Compare April 20, 2026 16:19

trvachov changed the title ~~[Draft] Mixtral recipes~~ Mixtral recipes Apr 20, 2026

svc-bionemo mentioned this pull request Apr 20, 2026

fix(ci): use gitleaks dir mode so pre-commit catches secrets in CI #1556

Merged

trvachov force-pushed the trvachov/mixtral-recipes-repush branch from 160a754 to 63f58d3 Compare April 20, 2026 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral recipes#1551

Mixtral recipes#1551
trvachov wants to merge 1 commit intomainfrom
trvachov/mixtral-recipes-repush

trvachov commented Apr 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading

Review skipped

Uh oh!

trvachov commented Apr 15, 2026

Uh oh!

trvachov commented Apr 15, 2026

Uh oh!

trvachov commented Apr 15, 2026

Uh oh!

trvachov commented Apr 16, 2026

Uh oh!

pstjohn commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

trvachov commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Triggering Code Rabbit AI Review

Pre-submit Checklist

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

coderabbitai Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

trvachov commented Apr 15, 2026

Uh oh!

trvachov commented Apr 15, 2026

Uh oh!

trvachov commented Apr 15, 2026

Uh oh!

trvachov commented Apr 16, 2026

Uh oh!

pstjohn commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trvachov commented Apr 15, 2026 •

edited

Loading

coderabbitai Bot commented Apr 15, 2026 •

edited

Loading