Skip to content

Mixtral recipes#1551

Open
trvachov wants to merge 1 commit intomainfrom
trvachov/mixtral-recipes-repush
Open

Mixtral recipes#1551
trvachov wants to merge 1 commit intomainfrom
trvachov/mixtral-recipes-repush

Conversation

@trvachov
Copy link
Copy Markdown
Collaborator

@trvachov trvachov commented Apr 15, 2026

Description

Adds two new self-contained training recipes for MoE (Mixture-of-Experts) models following the existing Llama3/ESM2 recipe patterns. Out of scope: Complete README with benchmarks -- that is work in progress for an upcoming PR.

New recipes:

  • bionemo-recipes/recipes/mixtral_native_te/ — TE-accelerated Mixtral FSDP2 training with Lingua-style DCLM Baseline pre-training config (8x1B and 8x7B). Includes DDP and FSDP2 entry points.
  • bionemo-recipes/recipes/opengenome2_mixtral_native_te/ — TE Mixtral for autoregressive DNA on OpenGenome2 metagenomes. Mirrors opengenome2_llama_native_te (THD packing, genomic label masking, validation,
    nucleotide tokenizer packaged with recipe).

Key design decisions:

  • Self-contained KISS: duplicated fused MoE kernels (fused_a2a.py, fused_token_router.py, fused_indices_converter.py), collator, checkpoint, perf logger, etc. across both recipes rather than sharing, matching
    repo convention.
  • Expert parallelism: configurable expert_parallel_size with all-to-all token dispatch; EP=1 default for single-GPU parity with Llama3 recipe.
  • MXFP8 alignment fix (f4013db): pad post-alltoall MoE expert input to multiple of 32 before GroupedLinear, attribute padding to last expert, slice off output. No-op for non-MXFP8/already-aligned. Verified on 8x
    B300 with 8x7B EP=8 @ SEQ=8192.
  • Checkpointing: DCP format (.distcp files) for FSDP2; dedicated distributed checkpointing tests.
  • Review cleanup (136be26): removed 10 hardcoded-path dev configs, aligned Dockerfiles to 26.03 base, expanded train tests (3→7 single-GPU, 1→4 two-GPU), added dataset + checkpoint tests.
  • CI robustness (6ca0891): session-scoped local WordLevel tokenizer fixture to avoid HF Hub dependency.

Usage

On Blackwell, MXFP8 training, 8x7b

torchrun --standalone --nproc_per_node=8 train_fsdp2.py \
      --config-name defaults \
      \
      # --- Model ---
      config_name_or_path=./model_configs/mixtral-8x7B \
      +config_kwargs.attn_input_format=thd \
      +config_kwargs.self_attn_mask_type=padding_causal \
      +config_kwargs.max_position_embeddings=8192 \
      \
      # --- Precision: MXFP8 block scaling ---
      fp8_config.enabled=true \
      fp8_config.fp8_recipe=transformer_engine.common.recipe.MXFP8BlockScaling \
      \
      # --- Parallelism: pure EP=8 across the 8 ranks ---
      expert_parallel_size=8 \
      token_dispatcher=alltoall \
      \
      # --- THD sequence packing (max throughput on variable-length data) ---
      use_sequence_packing=true \
      use_meta_device=true \
      \
      # --- Data ---
      dataset.tokenizer_name_or_path=/path/to/tokenizer \
      ~dataset.micro_batch_size \
      +dataset.token_micro_batch_size=16384 \
      dataset.max_seq_length=8192 \
      dataset.stride=64 \
      dataset.pad_sequences_to_be_divisible_by=32 \
      dataset.load_dataset_kwargs.path=parquet \
      +dataset.load_dataset_kwargs.data_files=your_data.parquet \
      \
      # --- Training loop ---
      num_train_steps=500 \
      logger.frequency=5 \
      \
      # --- WandB ---
      wandb.project=mixtral-benchmark-sweep \
      wandb.name=te-mxfp8-thd-max \
      \
      # --- Checkpointing disabled for benchmarks ---
      checkpoint.ckpt_dir=null \
      checkpoint.save_final_model=false

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: fc965d79-2da8-43fb-bddc-dd5cbb72ac68

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch trvachov/mixtral-recipes-repush

Comment @coderabbitai help to get the list of available commands and usage tips.

@trvachov
Copy link
Copy Markdown
Collaborator Author

/ok to test 49e8426

@trvachov
Copy link
Copy Markdown
Collaborator Author

/ok to test 136be26

@trvachov
Copy link
Copy Markdown
Collaborator Author

/ok to test cad830b

@trvachov
Copy link
Copy Markdown
Collaborator Author

/ok to test 6ca0891

@trvachov trvachov force-pushed the trvachov/mixtral-recipes-repush branch from 48ea1e1 to 160a754 Compare April 20, 2026 16:19
@trvachov trvachov changed the title [Draft] Mixtral recipes Mixtral recipes Apr 20, 2026
@pstjohn
Copy link
Copy Markdown
Collaborator

pstjohn commented Apr 20, 2026

/ok to test 160a754

Two self-contained recipes following existing Llama3/ESM2 recipe
conventions:

- bionemo-recipes/recipes/mixtral_native_te/: TE-accelerated Mixtral
  FSDP2 training with a Lingua-style DCLM Baseline 1.0 pre-training
  config for Mixtral-8x1B and 8x7B. Includes DDP and FSDP2 entry points.

- bionemo-recipes/recipes/opengenome2_mixtral_native_te/: TE Mixtral for
  autoregressive DNA on OpenGenome2 metagenomes, mirroring
  opengenome2_llama_native_te (THD packing, genomic label masking,
  validation, nucleotide tokenizer packaged with the recipe).

Key design decisions:

- Self-contained KISS: fused MoE kernels (fused_a2a, fused_token_router,
  fused_indices_converter), collator, checkpoint, and perf logger are
  duplicated across both recipes rather than shared, matching repo
  convention.

- Configurable expert parallelism via all-to-all token dispatch;
  expert_parallel_size=1 by default for parity with the Llama3 recipe.

- MXFP8 alignment: pad post-alltoall MoE expert input to a multiple of
  32 before GroupedLinear (attribute padding to the last expert so
  m_splits sums correctly; slice padding off the output). No-op for
  non-MXFP8 and already-aligned batches. Verified on 8x B300 SXM6 with
  Mixtral-8x7B EP=8 at SEQ=8192: FP8 1.196 s/step, MXFP8 1.248 s/step.

- FSDP2 checkpointing uses DCP format (.distcp files), covered by
  dedicated distributed checkpointing tests.

- CI-robust tests: session-scoped local WordLevel tokenizer fixture
  avoids HuggingFace Hub dependency; expanded train coverage (7
  single-GPU, 4 two-GPU tests per recipe) plus dataset and distributed
  checkpoint tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Timur Rvachov <trvachov@nvidia.com>
@trvachov trvachov force-pushed the trvachov/mixtral-recipes-repush branch from 160a754 to 63f58d3 Compare April 20, 2026 19:20
AsherBond pushed a commit to Distillative-AI/bionemo-framework that referenced this pull request Apr 21, 2026
…VIDIA#1556)

## Problem

The gitleaks pre-commit hook is silently passing in CI, even when
secrets are present. See
[NVIDIA#1551](NVIDIA#1551) which
includes a hardcoded `WANDB_API_KEY` that gitleaks did not flag.

**Root cause:** The default gitleaks hook entry is:
```
gitleaks git --pre-commit --redact --staged --verbose
```

This scans **staged git changes** — it works during an actual `git
commit`. But in CI, `static_checks.sh` runs:
```
pre-commit run --all-files
```

With `--all-files`, there are no staged files and no commit context, so
gitleaks scans **0 commits** and reports "no leaks found":
```
7:02PM INF 0 commits scanned.
7:02PM INF scanned ~0 bytes (0) in 28.9ms
7:02PM INF no leaks found
```

## Fix

Override the hook entry to use `gitleaks dir --redact --verbose`, which
scans **file contents** directly. This works correctly both:
- Locally during `git commit` (pre-commit hook)
- In CI with `pre-commit run --all-files`

## Testing

After this change, running `pre-commit run gitleaks --all-files` on the
repo will scan actual file contents instead of scanning 0 commits.

---------

Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Co-authored-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
Co-authored-by: Peter St. John <pstjohn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants