Dev/tas/moe package v2.0 by Xiaoming-AMD · Pull Request #599 · AMD-AGI/Primus

Xiaoming-AMD · 2026-03-12T08:25:17Z

No description provided.

4layers proxy HSA_NO_SCRATCH_RECLAIM=1 NVTE_CK_USES_BWD_V3=1 GPU_MAX_HW_QUEUES=4

Copilot

Pull request overview

Adds/updates Primus Megatron training launch scripts and model configs for MoE workloads (Qwen3 A3B/A22B, DeepSeek V2/V3), plus supporting tooling and infra changes (data prep, Slurm/local launch behavior, logging, CI).

Changes:

Add multiple “start_training_*.sh” entrypoint scripts for Slurm-based pretraining runs.
Update Megatron YAML configs to enable MoE dispatcher/grouped GEMM and cross-entropy TE fusion in several Qwen3/DeepSeek configs.
Adjust runtime tooling: Slurm launcher env handling, local launcher container behavior, training log memory reporting, C4 prep script, and CI workflow.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
start_training_qwen_30B_a3B.sh	New Slurm training wrapper for Qwen3-30B-A3B.
start_training_qwen_235B_a22B.sh	New Slurm training wrapper for Qwen3-235B-A22B, including PP/VPP layout logic.
start_training_dsv3_4layers_proxy.sh	New small-scale DeepSeek-V3 script with platform-based docker selection.
start_training_dsv3.sh	New/updated DeepSeek-V3 training wrapper with PP/VPP layout handling.
start_training_dsv2_lite.sh	New Slurm training wrapper for DeepSeek-V2-Lite.
primus/configs/models/megatron/qwen2.5_base.yaml	Adds MoE dispatcher/grouped GEMM defaults to Qwen base config.
primus/backends/megatron/patches/training_log/print_rank_last_patches.py	Enhances log output with max ROCm memory usage across ranks.
prepare_c4_data.sh	Adds a script to merge/tokenize C4-en shards for Megatron pretraining.
examples/run_slurm_pretrain.sh	Switches srun options to use `SLURM_*` env vars for time/partition/nodelist.
examples/run_pretrain.sh	Adds/changes AINIC networking/NCCL tuning exports.
examples/run_local_pretrain.sh	Adds a fixed data mount, image pull-if-missing behavior, and container ulimit.
examples/megatron/prepare.py	Adds `--no-build-isolation` to Emerging-Optimizers editable install.
examples/megatron/configs/MI355X/qwen3_30B_A3B-FP8-pretrain.yaml	Enables MoE dispatcher + enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/qwen3_30B_A3B-BF16-pretrain.yaml	Enables MoE dispatcher/grouped GEMM + enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/qwen3_235B_A22B-FP8-pretrain.yaml	Enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/qwen3_235B_A22B-BF16-pretrain.yaml	Enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/deepseek_v3-FP8-pretrain.yaml	Adds Turbo/deepep-related settings + enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml	Changes default PP size + adds Turbo/deepep settings + enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/deepseek_v2_lite-FP8-pretrain.yaml	Enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml	Enables TE cross-entropy fusion flags.
.github/workflows/ci.yaml	Comments out the JAX unit test job.

Copilot · 2026-03-12T08:31:06Z

+if [ "$PLATFORM" = "MI355X" ]; then
+  export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"
+elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then
+  export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"
+  EXTRA_ARGS="--use_rocm_mem_info_iters None"


EXTRA_ARGS is built as a single string containing spaces and later passed as "$EXTRA_ARGS", which becomes a single CLI argument (and even an empty argument when unset). This will break flag parsing for run_slurm_pretrain.sh. Use a bash array for optional args (e.g., EXTRA_ARGS=() and append --use_rocm_mem_info_iters and None as separate elements) and expand it with "${EXTRA_ARGS[@]}" (or omit entirely when empty).

Suggested change

if [ "$PLATFORM" = "MI355X" ]; then

export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"

elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then

export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"

EXTRA_ARGS="--use_rocm_mem_info_iters None"

EXTRA_ARGS=()

if [ "$PLATFORM" = "MI355X" ]; then

export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"

elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then

export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"

EXTRA_ARGS=(--use_rocm_mem_info_iters None)

Copilot · 2026-03-12T08:31:07Z

+    export NCCL_IB_GID_INDEX=1
+    # export NCCL_IB_ROCE_VERSION_NUM=2
+    export NCCL_MAX_P2P_CHANNELS=56
+    # export NCCL_IB_TC=104
+    # export NCCL_IB_FIFO_TC=192
+    export NCCL_IB_TC=41
+    export NCCL_IB_FIFO_TC=185
+    export NET_OPTIONAL_RECV_COMPLETION=1
+    export NCCL_IB_USE_INLINE=1
+    export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0
+    export NCCL_GDR_FLUSH_DISABLE=1
+    export NCCL_DMABUF_ENABLE=0
+    export NCCL_IGNORE_CPU_AFFINITY=1
+    export NCCL_IB_QPS_PER_CONNECTION=1
+
+    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH
+
 else
    export NCCL_IB_GID_INDEX=3


This block re-exports several NCCL/RCCL tuning env vars (including NCCL_IB_GID_INDEX, NCCL_MAX_P2P_CHANNELS, NCCL_IB_TC, NCCL_IB_FIFO_TC) after they were already computed/logged earlier in the same USING_AINIC branch. In particular, forcing NCCL_IB_TC=41/NCCL_IB_FIFO_TC=185 overrides the earlier detection/default logic and makes the earlier logs misleading. Consider removing this duplicate section, or only setting values when not already set so the detection path remains effective.

Suggested change

export NCCL_IB_GID_INDEX=1

# export NCCL_IB_ROCE_VERSION_NUM=2

export NCCL_MAX_P2P_CHANNELS=56

# export NCCL_IB_TC=104

# export NCCL_IB_FIFO_TC=192

export NCCL_IB_TC=41

export NCCL_IB_FIFO_TC=185

export NET_OPTIONAL_RECV_COMPLETION=1

export NCCL_IB_USE_INLINE=1

export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0

export NCCL_GDR_FLUSH_DISABLE=1

export NCCL_DMABUF_ENABLE=0

export NCCL_IGNORE_CPU_AFFINITY=1

export NCCL_IB_QPS_PER_CONNECTION=1

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH

else

export NCCL_IB_GID_INDEX=3

[ -z "${NCCL_IB_GID_INDEX}" ] && export NCCL_IB_GID_INDEX=1

# export NCCL_IB_ROCE_VERSION_NUM=2

[ -z "${NCCL_MAX_P2P_CHANNELS}" ] && export NCCL_MAX_P2P_CHANNELS=56

# export NCCL_IB_TC=104

# export NCCL_IB_FIFO_TC=192

[ -z "${NCCL_IB_TC}" ] && export NCCL_IB_TC=41

[ -z "${NCCL_IB_FIFO_TC}" ] && export NCCL_IB_FIFO_TC=185

[ -z "${NET_OPTIONAL_RECV_COMPLETION}" ] && export NET_OPTIONAL_RECV_COMPLETION=1

[ -z "${NCCL_IB_USE_INLINE}" ] && export NCCL_IB_USE_INLINE=1

[ -z "${RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING}" ] && export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0

[ -z "${NCCL_GDR_FLUSH_DISABLE}" ] && export NCCL_GDR_FLUSH_DISABLE=1

[ -z "${NCCL_DMABUF_ENABLE}" ] && export NCCL_DMABUF_ENABLE=0

[ -z "${NCCL_IGNORE_CPU_AFFINITY}" ] && export NCCL_IGNORE_CPU_AFFINITY=1

[ -z "${NCCL_IB_QPS_PER_CONNECTION}" ] && export NCCL_IB_QPS_PER_CONNECTION=1

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH

else

[ -z "${NCCL_IB_GID_INDEX}" ] && export NCCL_IB_GID_INDEX=3

Copilot · 2026-03-12T08:31:07Z

+JSONL_FILE="${JSONL_DIR}/c4_en_train.jsonl"
+
+if [ -f "${JSONL_FILE}" ]; then
+    echo "JSONL file already exists: ${JSONL_FILE}"
+    echo "Skipping merge. Delete it to re-merge."
+else
+    # Verify shards exist
+    MISSING=0
+    for i in $(seq 0 $((NUM_SHARDS - 1))); do
+        SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
+        if [ ! -f "${RAW_DIR}/${SHARD_NAME}" ]; then
+            echo "  WARNING: Missing shard ${SHARD_NAME}"
+            MISSING=$((MISSING + 1))
+        fi
+    done
+    if [ "$MISSING" -gt 0 ]; then
+        echo "ERROR: ${MISSING} shard(s) missing in ${RAW_DIR}. Cannot proceed."
+        exit 1
+    fi
+
+    echo "Decompressing and merging shards into JSONL ..."
+    for i in $(seq 0 $((NUM_SHARDS - 1))); do
+        SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
+        SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
+        echo "  [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
+        zcat "${SHARD_PATH}" >> "${JSONL_FILE}"
+    done


The JSONL merge step appends shards directly into the final c4_en_train.jsonl. If the script is interrupted mid-merge, the partially written JSONL file will exist and subsequent runs will skip merging, leaving a corrupt/incomplete dataset. Write to a temporary file and mv atomically on success (or truncate the target before the loop and add a checksum/size check) to make reruns safe.

Copilot · 2026-03-12T08:31:08Z

+      pipeline_model_parallel_size: ${PRIMUS_PP:8}
      expert_model_parallel_size: ${PRIMUS_EP:8}


Changing the default pipeline_model_parallel_size from ${PRIMUS_PP:1} to ${PRIMUS_PP:8} makes the config default to PP=8 when PRIMUS_PP isn’t set, which will break/behave unexpectedly for single-node or small runs. Consider keeping the default at 1 (and letting training scripts set PRIMUS_PP explicitly), or document very clearly that this config now requires PRIMUS_PP to be set appropriately.

Suggested change

pipeline_model_parallel_size: ${PRIMUS_PP:8}

expert_model_parallel_size: ${PRIMUS_EP:8}

pipeline_model_parallel_size: ${PRIMUS_PP:1}

expert_model_parallel_size: ${PRIMUS_EP:1}

Copilot · 2026-03-12T08:31:08Z

+     ${SLURM_TIME:+--time="${SLURM_TIME}"} \
+     ${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \
+     ${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \


Switching to SLURM_NODELIST/SLURM_TIME/SLURM_PARTITION as user-provided overrides is confusing because SLURM_NODELIST is also a Slurm-populated env var inside allocations. That makes it hard to distinguish “allocated nodes” vs “requested override nodes”, and can accidentally force --nodelist when running inside Slurm. Consider keeping the previous NODE_LIST override (or renaming to something like PRIMUS_NODELIST_OVERRIDE) and updating the script to prefer that, falling back to Slurm’s own env vars when present.

Copilot · 2026-03-12T08:31:08Z

+                    f" | rank-{max_rank} rocm max mem usage/usage_ratio: "
+                    f"{max_r_used / 1024 ** 3:.2f}GB/"
+                    f"{max_r_used / r_total * 100:.2f}%"


The reported rank-{max_rank} rocm max mem ... usage_ratio uses max_r_used / r_total, where r_total is from the local rank, not necessarily max_rank. If total memory differs across GPUs/nodes, this ratio can be incorrect. Consider gathering r_total as well (or computing the ratio on the max-rank and broadcasting it) so the max-rank ratio is self-consistent.

Copilot · 2026-03-12T08:31:09Z

 ARGS=("$@")

-VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
+VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")


Adding a hard-coded bind mount for /shared_aig/c4 makes run_local_pretrain.sh environment-specific and can fail (or create unexpected host directories) on machines that don’t have that path. Consider making this conditional (only mount if the directory exists) or driven by an env var (e.g., EXTRA_VOLUME_ARGS / PRIMUS_DATA_MOUNTS) so the script remains portable.

Suggested change

VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")

VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")

if [ -d "/shared_aig/c4" ]; then

VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4")

fi

if [[ -n "${EXTRA_VOLUME_ARGS:-}" ]]; then

# Allow callers to inject additional volume args, e.g. EXTRA_VOLUME_ARGS='-v /host/path:/container/path'

VOLUME_ARGS+=(${EXTRA_VOLUME_ARGS})

fi

Copilot · 2026-03-12T08:31:09Z

+  # run-unittest-jax:
+  #   env:
+  #     PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax
+  #   needs: [code-lint]
+  #   runs-on: [primus-lm-cicd-jax-8t8mh]
+  #   steps:
+  #     - run: echo "🎉 Begin Primus-Turbo Checkout."
+  #     - name: Set commit hash to env
+  #       run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
+  #     - name: Checkout Repo Primus-Turbo
+  #       uses: actions/checkout@v4
+  #       with:
+  #         repository: AMD-AIG-AIMA/Primus-Turbo
+  #         submodules: "recursive"
+  #         path: Primus-Turbo
+  #         ref: ${{ env.PRIMUS_TURBO_COMMIT }}
+  #     - run: echo "Begin Primus-Turbo Install."
+  #     - name: Install Primus-Turbo
+  #       run: |
+  #         mv Primus-Turbo /tmp/
+  #         echo "Primus-Turbo dir: /tmp/Primus-Turbo"
+  #         git config --global --add safe.directory /tmp/Primus-Turbo
+  #         cd /tmp/Primus-Turbo
+  #         start_time=$(date +%s)
+  #         echo "✅ [Pip install requirements] started at: $(date)"


This PR comments out the entire run-unittest-jax job, and there doesn’t appear to be any other workflow job running the --jax unit tests after this change. If JAX support is still maintained, this will remove CI coverage and allow regressions to land. Consider fixing the runner selection / installation issue and keeping the job enabled, or replacing it with an equivalent JAX test job.

Suggested change

# run-unittest-jax:

# env:

# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax

# needs: [code-lint]

# runs-on: [primus-lm-cicd-jax-8t8mh]

# steps:

# - run: echo "🎉 Begin Primus-Turbo Checkout."

# - name: Set commit hash to env

# run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV

# - name: Checkout Repo Primus-Turbo

# uses: actions/checkout@v4

# with:

# repository: AMD-AIG-AIMA/Primus-Turbo

# submodules: "recursive"

# path: Primus-Turbo

# ref: ${{ env.PRIMUS_TURBO_COMMIT }}

# - run: echo "Begin Primus-Turbo Install."

# - name: Install Primus-Turbo

# run: |

# mv Primus-Turbo /tmp/

# echo "Primus-Turbo dir: /tmp/Primus-Turbo"

# git config --global --add safe.directory /tmp/Primus-Turbo

# cd /tmp/Primus-Turbo

# start_time=$(date +%s)

# echo "✅ [Pip install requirements] started at: $(date)"

run-unittest-jax:

env:

PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax

needs: [code-lint]

runs-on: [primus-lm-cicd-jax-8t8mh]

steps:

- name: Begin JAX unit tests

run: echo "🎉 Begin JAX unit tests in ${PRIMUS_WORKDIR}"

- name: Set commit hash to env

run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV

- name: Checkout repository

uses: actions/checkout@v4

- name: Set up Python

uses: actions/setup-python@v5

with:

python-version: "3.10"

- name: Install dependencies (including JAX extras)

run: |

python -m pip install --upgrade pip

if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

# Attempt to install JAX-related extras if defined; ignore failures

pip install -e ".[jax,test]" || pip install -e ".[jax]" || true

- name: Run JAX unit tests

run: |

pytest --jax

Copilot · 2026-03-12T08:31:09Z

+  --profile_step_end 7 \
+  --profile_step_start 6 \


TRAIN_ITERS is set to 5, but --profile_step_start 6 / --profile_step_end 7 are outside the iteration range. If profiling is enabled (PROFILE=True), this likely results in no profiling data being captured. Consider tying the profile step range to TRAIN_ITERS (or lowering these defaults) so the script works when toggling profiling on.

Suggested change

--profile_step_end 7 \

--profile_step_start 6 \

--profile_step_end $TRAIN_ITERS \

--profile_step_start 1 \

Copilot · 2026-03-12T08:31:10Z

+  --profile $PROFILE \
+  --use_pytorch_profiler $PROFILE \
+  --profile_step_end 7 \
+  --profile_step_start 6 \


TRAIN_ITERS is set to 5, but --profile_step_start 6 / --profile_step_end 7 are outside the iteration range. If profiling is enabled (PROFILE=True), this likely results in no profiling data being captured. Consider tying the profile step range to TRAIN_ITERS (or lowering these defaults) so the script works when toggling profiling on.

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 16 comments.

+    # unset NCCL_IB_GID_INDEX
+    export NCCL_IB_GID_INDEX=1
+    # export NCCL_IB_ROCE_VERSION_NUM=2
+    export NCCL_MAX_P2P_CHANNELS=56
+    # export NCCL_IB_TC=104
+    # export NCCL_IB_FIFO_TC=192
+    export NCCL_IB_TC=41
+    export NCCL_IB_FIFO_TC=185


+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


+# ======================== Configuration ========================
+NUM_SHARDS=${NUM_SHARDS:-200}           # Number of C4 shards to download (1-1024)
+DATA_DIR=${DATA_DIR:-"/shared/c4"}
+PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}


+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


    log_info(f"Building Emerging Optimizers in {emerging_optimizers_path}")
-    ret = subprocess.run(["pip", "install", "-e", str(emerging_optimizers_path)], check=True)
+    ret = subprocess.run(
+        ["pip", "install", "--no-build-isolation", "-e", str(emerging_optimizers_path)], check=True


+  # run-unittest-jax:
+  #   env:
+  #     PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax
+  #   needs: [code-lint]
+  #   runs-on: [primus-lm-cicd-jax-8t8mh]
+  #   steps:
+  #     - run: echo "🎉 Begin Primus-Turbo Checkout."
+  #     - name: Set commit hash to env
+  #       run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
+  #     - name: Checkout Repo Primus-Turbo
+  #       uses: actions/checkout@v4
+  #       with:
+  #         repository: AMD-AIG-AIMA/Primus-Turbo
+  #         submodules: "recursive"
+  #         path: Primus-Turbo
+  #         ref: ${{ env.PRIMUS_TURBO_COMMIT }}
+  #     - run: echo "Begin Primus-Turbo Install."
+  #     - name: Install Primus-Turbo
+  #       run: |
+  #         mv Primus-Turbo /tmp/
+  #         echo "Primus-Turbo dir: /tmp/Primus-Turbo"
+  #         git config --global --add safe.directory /tmp/Primus-Turbo
+  #         cd /tmp/Primus-Turbo
+  #         start_time=$(date +%s)
+  #         echo "✅ [Pip install requirements] started at: $(date)"


+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"


+     ${SLURM_TIME:+--time="${SLURM_TIME}"} \
+     ${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \
+     ${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \


+moe_token_dispatcher_type: alltoall
+
+moe_grouped_gemm: true


Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 13 comments.

Comments suppressed due to low confidence (1)

primus/configs/models/megatron/qwen2.5_base.yaml:34

This YAML now defines moe_token_dispatcher_type and moe_grouped_gemm twice (once around lines 26–28 and again around lines 32–35). Duplicate keys in YAML are ambiguous (parsers typically take the last value) and can hide configuration mistakes. Please remove the redundant earlier entries or consolidate them into a single block.

moe_token_dispatcher_type: alltoall

moe_grouped_gemm: true
apply_rope_fusion: true
masked_softmax_fusion: false

moe_token_dispatcher_type: alltoall
moe_grouped_gemm: true
moe_use_legacy_grouped_gemm: false

+  --disable_wandb True \
+  --disable_tensorboard True \
+  "$EXTRA_ARGS" \
+  2>&1 | tee output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME/log.txt


+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


 srun -N "${NNODES}" \
     --exclusive \
     --export ALL \
-     ${NODE_LIST:+--nodelist="${NODE_LIST}"} \
     --ntasks-per-node=1 \
+     ${SLURM_TIME:+--time="${SLURM_TIME}"} \
+     ${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \
+     ${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \


+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key
+


+  # run-unittest-jax:
+  #   env:
+  #     PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax
+  #   needs: [code-lint]
+  #   runs-on: [primus-lm-cicd-jax-8t8mh]
+  #   steps:


+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key


+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"


+    export NCCL_IB_TC=41
+    export NCCL_IB_FIFO_TC=185


+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key
+


Remove the --debug gate so users always see the complete stack trace alongside the one-line summary, reducing back-and-forth when diagnosing failures. Made-with: Cursor

Copilot

Pull request overview

Copilot reviewed 159 out of 159 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (13)

runner/primus-cli-container.sh:1

This now hard-fails when neither docker nor podman is present, even in dry-run mode. The previous logic explicitly allowed a mock runtime during dry-runs; without that, --dry-run becomes unusable on hosts lacking container tooling. Consider restoring the dry-run fallback (mock runtime) or skipping runtime validation when DRY_RUN_MODE==true.
primus/modules/trainer/megatron/trainer.py:1
The new defaulting logic does not handle the common case where args.moe_layer_freq exists but is None. In that case, the value remains None and downstream code expecting an int/list can break. Suggest setting the default when getattr(args, \"moe_layer_freq\", None) is None (not only when the attribute is missing).
runner/helpers/envs/base_env.sh:1
The comments and defaults are now inconsistent: HSA_NO_SCRATCH_RECLAIM is defaulting to 1, but the comment claims setting it to 0 prevents core dumps; similarly the TE comment says to disable v3 due to accuracy issues, but the default is now 1 (enabled). Please update the comments to match the new defaults/behavior, or revert the defaults to align with the documented guidance.
runner/helpers/envs/base_env.sh:1
The comments and defaults are now inconsistent: HSA_NO_SCRATCH_RECLAIM is defaulting to 1, but the comment claims setting it to 0 prevents core dumps; similarly the TE comment says to disable v3 due to accuracy issues, but the default is now 1 (enabled). Please update the comments to match the new defaults/behavior, or revert the defaults to align with the documented guidance.
start_training_dsv3.sh:1
This script exports SLURM_NODELIST four times (only the last value takes effect), and one intermediate value contains -1557 inside the bracket expression, which looks like an invalid node/range token. Please remove the redundant exports and keep a single valid SLURM_NODELIST (or make the selection conditional via an argument/env var).
start_training_dsv3_4layers_proxy.sh:1
The first PRIMUS_EXP_NAME assignment is immediately overwritten, so it will never be used. This makes it hard to tell which experiment naming convention is intended. Consider making the debug name conditional (e.g., gated on an env var) or removing the unused assignment.
primus/backends/megatron/patches/training_log/print_rank_last_patches.py:1
Using all_gather allocates O(world_size) tensors and moves a list of values to every rank on each sample, which can add noticeable overhead at large scale. If you only need the maximum, consider an all_reduce(MAX) on r_used_tensor. If you also want the rank of the max, consider reducing a (value, rank) pair via a custom reduction or gather only on rank0.
primus/backends/megatron/patches/pp_warmup_patches.py:1
This warmup implementation assumes a very specific model structure (model iterable of chunks, then model_chunk.module.module.decoder.layers) and a specific per-layer forward signature/return shape (dummy_output, _ = layer.forward(...)). This is likely to break for non-GPT models (e.g., Mamba), encoder-decoder variants, or Megatron version differences, causing pre-train crashes when pp_warmup is enabled. Recommend adding robust feature-detection (validate attribute chain + callable signature) and falling back to a safer warmup (e.g., a single dummy forward/backward through the normal forward_step_func), or gating the patch to known-compatible backend/model_types.
primus/backends/megatron/patches/pp_warmup_patches.py:1
This warmup implementation assumes a very specific model structure (model iterable of chunks, then model_chunk.module.module.decoder.layers) and a specific per-layer forward signature/return shape (dummy_output, _ = layer.forward(...)). This is likely to break for non-GPT models (e.g., Mamba), encoder-decoder variants, or Megatron version differences, causing pre-train crashes when pp_warmup is enabled. Recommend adding robust feature-detection (validate attribute chain + callable signature) and falling back to a safer warmup (e.g., a single dummy forward/backward through the normal forward_step_func), or gating the patch to known-compatible backend/model_types.
runner/helpers/hooks/03_enable_ainic.sh:1
These settings are now hard-coded and will override any values a user may have set in the environment. Previously the script used ${VAR:-default} patterns, which allowed overrides for cluster-specific tuning. Consider restoring the override-friendly defaults (e.g., NCCL_IB_TC=${NCCL_IB_TC:-41} etc.) so operators can tune without editing the hook.
primus/backends/maxtext/patches/max_utils_patches.py:1
This monkey-patches jax.distributed.initialize globally and does not restore it. In long-lived processes (or if multiple backends/modules are initialized in one process), this can leak behavior across unrelated code and make debugging version-specific JAX issues harder. Consider restoring the original function after MaxText initialization (e.g., patch only within the initialization call scope) or attaching the wrapper in a way that can be cleanly undone during teardown.
primus/cli/main.py:1
The exception handler now always prints the full traceback via traceback.print_exc(), then also prints a summarized error line. This changes the CLI behavior from the previous --debug-gated verbosity and can significantly increase log volume (and potentially expose internal paths) in production runs. Consider restoring the --debug gating (or an env flag) for full tracebacks, while keeping the concise error+location output for non-debug runs.
primus/cli/main.py:1
The exception handler now always prints the full traceback via traceback.print_exc(), then also prints a summarized error line. This changes the CLI behavior from the previous --debug-gated verbosity and can significantly increase log volume (and potentially expose internal paths) in production runs. Consider restoring the --debug gating (or an env flag) for full tracebacks, while keeping the concise error+location output for non-debug runs.

add opt config --------- Co-authored-by: HuangWei-95 <weihuan@amd.com>

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

Co-authored-by: HuangWei-95 <weihuan@amd.com>

Added configuration modifications to qwen3_30B, dated March 24th.

+    import megatron.core.pipeline_parallel as pp_module
+    import megatron.training.training as training_module
+
+    import primus.modules.trainer.megatron.utils as utils


Copilot

Pull request overview

Copilot reviewed 162 out of 162 changed files in this pull request and generated 6 comments.

Copilot · 2026-03-26T02:21:38Z

+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
+export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1683-1684,1691,1697]"


This script exports SLURM_NODELIST four times in a row; the first three assignments are dead code and one contains a malformed token (...,1536,-1557,1561,...). Keeping only a single SLURM_NODELIST (or making it an input override) will avoid confusion and accidental invalid nodelists.

Copilot · 2026-03-26T02:21:38Z

+  8)
+    FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*7|t*8|t*8|t*8|t*8|t*8|t*7|t*7,L'")
+    ;;
+  16)
+    FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*3|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*3|t*3,L'")
+    ;;
+  32)
+    FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*1|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*1|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*1,L'")
+    ;;


FEATURE_ARGS values include embedded single quotes (e.g., "'Et*7|...|...,L'"). Because the value is already quoted by Bash, those inner quotes will be passed literally to the CLI and may break pipeline_model_parallel_layout parsing. Pass the layout string without the extra '...' wrapper.

Copilot · 2026-03-26T02:21:38Z

 srun -N "${NNODES}" \
     --exclusive \
     --export ALL \
-     ${NODE_LIST:+--nodelist="${NODE_LIST}"} \
     --ntasks-per-node=1 \
+     ${SLURM_TIME:+--time="${SLURM_TIME}"} \
+     ${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \
+     ${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \


This srun invocation now uses SLURM_TIME, SLURM_NODELIST, and SLURM_PARTITION, but the script’s --help section (not shown in this hunk) still documents NODE_LIST. Please update the help text to match the new env var names/options to avoid user confusion.

Copilot · 2026-03-26T02:21:39Z

+        args = get_megatron_args()
+        if getattr(args, "pp_warmup", True):
+            run_pp_warmup(model, config, args, optimizer, get_timers())
+        return original_train(


getattr(args, "pp_warmup", True) defaults to True, so PP warmup will run even when the flag is absent. This is risky because warmup does extra forward/backward work and may break non-GPT models. Default this to False (or require the arg to be explicitly set), and add a one-time guard (e.g., closure flag) so warmup truly runs only once as stated in the docstring.

Copilot · 2026-03-26T02:21:39Z

 # Validate container runtime (docker/podman)
-# In dry-run mode, allow mock runtime if docker/podman not available
-if [[ "$DRY_RUN_MODE" == "true" ]]; then
-    if command -v podman >/dev/null 2>&1; then
-        export CONTAINER_RUNTIME="podman"
-    elif command -v docker >/dev/null 2>&1; then
-        export CONTAINER_RUNTIME="docker"
-    else
-        # Mock runtime for dry-run testing
-        export CONTAINER_RUNTIME="docker"
-        LOG_INFO_RANK0 "[container] Using mock container runtime for dry-run (no docker/podman found)"
-    fi
+if command -v docker >/dev/null 2>&1; then
+    export CONTAINER_RUNTIME="docker"
+elif command -v podman >/dev/null 2>&1; then
+    export CONTAINER_RUNTIME="podman"
 else
-    validate_container_runtime
+    LOG_ERROR "[container] Neither docker nor podman is available in PATH."
+    exit 1
 fi


This container-runtime validation now hard-fails when neither docker nor podman is on PATH, even in --dry-run mode. That defeats dry-run usage on login/CI hosts without container tooling. Consider keeping the previous behavior: in dry-run allow a mock/default runtime (or skip runtime validation entirely) while still validating in real runs.

Copilot · 2026-03-26T02:21:39Z

 # Note: Disable v3 due to accuracy issues. Will fix after TE version 2.1.
-export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-0}
+export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-1}


The comment says to disable v3 due to accuracy issues, but the default is now set to NVTE_CK_USES_BWD_V3:-1 (enabled). Either update the comment to match the new default rationale, or keep the default disabled (:-0) and enable only via explicit opt-in.

…ine-grained tuning (#630)

Copilot

Pull request overview

Copilot reviewed 165 out of 165 changed files in this pull request and generated 9 comments.

Copilot · 2026-03-29T07:58:35Z

+                # When pipeline parallelism (PP) is enabled, memory usage can vary across ranks.
+                # Therefore, we report the maximum ROCm memory usage across all ranks.
+                r_used_tensor = torch.tensor([r_used], device="cuda", dtype=torch.int64)
+                world_size = torch.distributed.get_world_size()
+                gathered_r_used = [torch.zeros_like(r_used_tensor) for _ in range(world_size)]
+                torch.distributed.all_gather(gathered_r_used, r_used_tensor)
+
+                total_r_used = [t.item() for t in gathered_r_used]
+                log_rank_0(f"total_r_used: {[round(r_used / 1024 ** 3, 2) for r_used in total_r_used]}")
+                max_r_used = max(total_r_used)
+                max_rank = total_r_used.index(max_r_used)


Collecting per-rank ROCm memory via all_gather and logging the full total_r_used list can add noticeable overhead and produce very large logs at scale. Consider using a cheaper collective (e.g., reduce/max) and only logging the aggregate (or gate the per-rank list behind a debug flag).

Copilot · 2026-03-29T07:58:35Z

+        raise ValueError(
+            f'When using recompute_layer_ids, recompute_granuarlity: {args.recompute_granularity} must be "full"'
+        )


Typo in the error message: recompute_granuarlity should be recompute_granularity (and matching spelling makes grep/debugging easier).

Copilot · 2026-03-29T07:58:35Z

+set -x
+
+export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}"  # make it your own hf token
+export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}"  # make it your own wandb api key


set -x will echo all commands, including the HF/W&B token exports. This can leak credentials into logs; disable xtrace by default or only enable it after sensitive env vars are set (or guard it behind a DEBUG flag).

Copilot · 2026-03-29T07:58:36Z

+export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}"  # make it your own hf token
+export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}"  # make it your own wandb api key


The default values for HF_TOKEN/WANDB_API_KEY include single quotes inside the parameter expansion (e.g. :-'your_hf_token'), so the resulting env var value will literally contain quotes. That will break authentication unless users manually remove them; use an unquoted placeholder (or empty default) instead.

Copilot · 2026-03-29T07:58:36Z

+export HF_TOKEN="your_hf_token"  # make it your own hf token
+export WANDB_API_KEY="your_wandb_api_key"  # make it your own wandb api key
+export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"


Hardcoding tokens in the script (export HF_TOKEN="your_hf_token", etc.) makes it easy to accidentally commit real credentials after editing. Prefer the ${VAR:-...} pattern (or require them to be provided externally) and avoid writing secrets into the repo.

Copilot · 2026-03-29T07:58:36Z

    # Select and instantiate the tokenizer.
    if args.tokenizer_type in CUSTOM_TOKENIZER_TYPES:
-        tokenizer = _HuggingFaceTokenizer(args.tokenizer_model)
+        tokenizer = _HuggingFaceTokenizer(args.tokenizer_model, trust_remote_code=True)
    else:


Enabling trust_remote_code=True for all HuggingFace tokenizers will execute arbitrary code from remote model repos. This is a significant supply-chain risk; make this opt-in via a config/CLI flag (default False) or restrict it to the specific tokenizers that require it.

Copilot · 2026-03-29T07:58:36Z

 # Prevent scratch memory from being reclaimed to stabilize large memory usage
 # NOTE: Must disable scratch reclaim to avoid MoE training crash on AMD GPUs
 # Setting this to 0 prevents core dumps when using Mixture-of-Experts (MoE) models
-export HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM:-0}
+export HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM:-1}



The comment says “Setting this to 0 prevents core dumps…”, but the default is now HSA_NO_SCRATCH_RECLAIM:-1. Either the comment is now incorrect or the default should remain 0; please align the comment with the actual behavior to avoid misconfiguration.

Copilot · 2026-03-29T07:58:37Z

+export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}"  # make it your own hf token
+export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}"  # make it your own wandb api key


The default values for HF_TOKEN/WANDB_API_KEY include single quotes inside the parameter expansion, so the env var value will literally contain quotes. This will break authentication unless edited; use an unquoted placeholder (or empty default) instead.

Copilot · 2026-03-29T07:58:37Z

 # Note: Disable v3 due to accuracy issues. Will fix after TE version 2.1.
-export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-0}
+export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-1}



The comment states “Disable v3 due to accuracy issues”, but the default was changed to NVTE_CK_USES_BWD_V3:-1, which enables it. Please either revert the default or update the comment to reflect the intended default/risks.

Add training configurations for the Minimax 2.5 and GLAM5 models, including single-node and 8-node/16-node training scripts. Co-authored-by: botaohu001 <Botao.Hu@amd.com>

Copilot

Pull request overview

Copilot reviewed 175 out of 176 changed files in this pull request and generated 9 comments.

Copilot · 2026-04-10T05:21:49Z

    def new_post_init(self):
        tmp = getattr(self, "recompute_granularity", None)
        self.recompute_granularity = None
        orig_post_init(self)
        self.recompute_granularity = tmp
+        validate_specified_recompute_layers(config_mod.TransformerConfig, args)



In new_post_init(), validate_specified_recompute_layers() is called with config_mod.TransformerConfig (the class) instead of the config instance (self). This will validate/mutate class attributes rather than the actual run’s config and can raise attribute errors or silently ignore the user-provided recompute_layer_ids. Pass the instance (self) into validate_specified_recompute_layers so it validates the effective config for this run.

Copilot · 2026-04-10T05:21:49Z

+        if self.config.packing and self.config.dataset_type != "synthetic":
+            if decoder_segment_ids is None:
+                decoder_segment_ids = jnp.ones(shape=query.shape[:2], dtype=jnp.int32)
+            attn_mask = SequenceDescriptor.from_segment_ids_and_pos(
+                segment_ids=decoder_segment_ids, segment_pos=None
+            )
+            qkv_layout = "THD_THD_THD"  # 'T3HD', 'THD_T2HD' or 'THD_THD_THD'


PrimusAttentionOp.cudnn_flash_attention references SequenceDescriptor but it is never imported in this module, which will raise NameError when packing is enabled. Import SequenceDescriptor from the appropriate MaxText/TE module (matching the legacy MaxText API) or avoid referencing it here.

Copilot · 2026-04-10T05:21:49Z

 # Validate container runtime (docker/podman)
-# In dry-run mode, allow mock runtime if docker/podman not available
-if [[ "$DRY_RUN_MODE" == "true" ]]; then
-    if command -v podman >/dev/null 2>&1; then
-        export CONTAINER_RUNTIME="podman"
-    elif command -v docker >/dev/null 2>&1; then
-        export CONTAINER_RUNTIME="docker"
-    else
-        # Mock runtime for dry-run testing
-        export CONTAINER_RUNTIME="docker"
-        LOG_INFO_RANK0 "[container] Using mock container runtime for dry-run (no docker/podman found)"
-    fi
+if command -v docker >/dev/null 2>&1; then
+    export CONTAINER_RUNTIME="docker"
+elif command -v podman >/dev/null 2>&1; then
+    export CONTAINER_RUNTIME="podman"


Container runtime detection now hard-fails if neither docker nor podman is in PATH. This also blocks --dry-run mode, which previously could be used for command-generation/testing on systems without a container runtime. Consider skipping runtime validation (or keeping the prior mock-runtime fallback) when DRY_RUN_MODE=true.

Copilot · 2026-04-10T05:21:50Z

        except SystemExit:
            raise
        except Exception:
            # Torchrun/elastic can sometimes suppress worker tracebacks.
            # Print here so users can see the real root cause.
-            if getattr(args, "debug", False):
-                traceback.print_exc()
-            else:
-                print(f"[Primus] Error: {traceback.format_exc().splitlines()[-1]}", file=sys.stderr)
-                print("[Primus] Re-run with --debug for full stack trace.", file=sys.stderr)
+            traceback.print_exc()


The CLI now always prints a full traceback (traceback.print_exc()) on any exception, regardless of the --debug flag. This makes --debug ineffective and can produce extremely noisy logs (especially under torchrun/elastic). Restore the prior behavior: print full traceback only when --debug is set, otherwise emit a concise error with a hint to re-run with --debug.

Copilot · 2026-04-10T05:21:50Z

+export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}"  # make it your own hf token
+export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}"  # make it your own wandb api key


HF_TOKEN/WANDB_API_KEY defaults include single quotes inside the parameter expansion (e.g., ${HF_TOKEN:-'your_hf_token'}), which will literally set the env var value to a quoted string when unset. Remove the inner single quotes so the default is a usable placeholder value.

Copilot · 2026-04-10T05:21:50Z

+export PRIMUS_EXP_NAME=minimax_m2.5-pretrain-platform_$PLATFORM-layers_$PRIMUS_TOTAL_LAYERS-type_$PRETRAIN_TYPE-mbs_$MBS-gbs_$GBS-legacygg_$LEGACY_GG
+export PRIMUS_EXP_NAME=debug_minimax_m2.5_4layers-type_$PRETRAIN_TYPE-legacygg_$LEGACY_GG-turbogg_$TURBO_GROUPED_MLP


PRIMUS_EXP_NAME is assigned twice; the second assignment overwrites the first and forces a debug name. If this is meant to be optional, gate it behind a flag or comment it out; otherwise remove the redundant assignment so runs get the expected experiment naming/output directory.

Copilot · 2026-04-10T05:21:50Z

+export PRIMUS_EXP_NAME=dsv3-pretrain-platform_$PLATFORM-layers_$PRIMUS_TOTAL_LAYERS-type_$PRETRAIN_TYPE-mbs_$MBS-gbs_$GBS-legacygg_$LEGACY_GG
+export PRIMUS_EXP_NAME=debug_4layers-type_$PRETRAIN_TYPE-legacygg_$LEGACY_GG-turbogg_$TURBO_GROUPED_MLP


PRIMUS_EXP_NAME is assigned twice; the second assignment overwrites the first and forces a debug name. If this is meant to be optional, gate it behind a flag or comment it out; otherwise remove the redundant assignment so runs get the expected experiment naming/output directory.

Copilot · 2026-04-10T05:21:50Z

+elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then
+  export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"
+  EXTRA_ARGS="--use_rocm_mem_info_iters None"
+else


EXTRA_ARGS is assigned as a single string containing a space ("--use_rocm_mem_info_iters None") and later passed as "${EXTRA_ARGS}". When quoted, this becomes one CLI argument (with an embedded space) instead of two separate arguments, which will likely break argparse parsing. Use an array for EXTRA_ARGS (and expand it as "${EXTRA_ARGS[@]}"), or avoid quoting and ensure correct word-splitting.

Copilot · 2026-04-10T05:21:51Z

+elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then
+  export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"
+  EXTRA_ARGS="--use_rocm_mem_info_iters None"
+else


EXTRA_ARGS is assigned as a single string containing a space ("--use_rocm_mem_info_iters None") and later passed as "${EXTRA_ARGS}". When quoted, this becomes one CLI argument (with an embedded space) instead of two separate arguments, which will likely break argparse parsing. Use an array for EXTRA_ARGS (and expand it as "${EXTRA_ARGS[@]}"), or avoid quoting and ensure correct word-splitting.

wenxie-amd and others added 11 commits February 27, 2026 12:44

dsv2_lite script; c4_data

4a7af06

disable ci ut

a5d1f43

dsv3

7ca203a

update configs

a693403

tune pp-vpp and mbs

f16821a

update rocm memory collection

7eea367

Merge branch 'main' into dev/tas/moe_package_v2.0

e098ac5

qwen3_30B, qwen3_235B (#593)

9657262

4layers proxy; add 4 env for perf (#597)

b5635a1

4layers proxy HSA_NO_SCRATCH_RECLAIM=1 NVTE_CK_USES_BWD_V3=1 GPU_MAX_HW_QUEUES=4

update proxy script

0155463

update proxy script

43f3c52

Xiaoming-AMD requested review from limou102 and wenxie-amd as code owners March 12, 2026 08:25

Copilot AI review requested due to automatic review settings March 12, 2026 08:25

Merge branch 'main' into dev/tas/moe_package_v2.0

285923e

Copilot started reviewing on behalf of Xiaoming-AMD March 12, 2026 08:26 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

xmpeng-dev added 2 commits March 13, 2026 08:26

use cli pipline for dsv3

35ec222

Refine dsv3 training config and stage-based pipeline layout

70388db

Copilot AI review requested due to automatic review settings March 13, 2026 08:54

Copilot started reviewing on behalf of Xiaoming-AMD March 13, 2026 08:55 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

wenxie-amd added 2 commits March 16, 2026 01:49

slurm-idle-node-check skill

5430570

update skill

25822f6

Copilot AI review requested due to automatic review settings March 16, 2026 01:57

Copilot started reviewing on behalf of wenxie-amd March 16, 2026 01:58 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

wenxie-amd added 2 commits March 16, 2026 02:38

add repo dir check

aa1d3f0

disable UT temporary

6f12922

Copilot AI review requested due to automatic review settings March 16, 2026 02:43

wenxie-amd and others added 2 commits March 24, 2026 02:31

slurm-idle-node-check: rdma link; show_gid

c5249ea

Always print full traceback on CLI errors

9991e77

Remove the --debug gate so users always see the complete stack trace alongside the one-line summary, reducing back-and-forth when diagnosing failures. Made-with: Cursor

Copilot AI review requested due to automatic review settings March 24, 2026 04:48

Copilot AI reviewed Mar 24, 2026

View reviewed changes

wenxie-amd and others added 2 commits March 24, 2026 06:12

add TURBO_RMS_NORM, default to false

49d1e86

Dev/tas/moe package qwen3 (#626)

3efa670

add opt config --------- Co-authored-by: HuangWei-95 <weihuan@amd.com>

Copilot AI review requested due to automatic review settings March 25, 2026 00:53

Copilot AI reviewed Mar 25, 2026

View reviewed changes

HuangWei-95 requested a review from Copilot March 25, 2026 00:53

Copilot AI reviewed Mar 25, 2026

View reviewed changes

lhzhang333 and others added 2 commits March 25, 2026 15:53

fix recompute_layer_ids patch (#627)

0a752fd

fix pp dump patch (#629)

b272569

Co-authored-by: HuangWei-95 <weihuan@amd.com>

Copilot AI review requested due to automatic review settings March 26, 2026 02:15

Copilot started reviewing on behalf of Xiaoming-AMD March 26, 2026 02:16 View session

Added configuration modifications to qwen3_30B, dated March 24th. (#628)

637bf4e

Added configuration modifications to qwen3_30B, dated March 24th.

github-code-quality bot found potential problems Mar 26, 2026

View reviewed changes

Comment thread primus/backends/megatron/patches/pp_dump_data_patches.py

import megatron.core.pipeline_parallel as pp_module

import megatron.training.training as training_module

import primus.modules.trainer.megatron.utils as utils

Copilot AI reviewed Mar 26, 2026

View reviewed changes

lhzhang333 and others added 2 commits March 27, 2026 16:54

update pp_layout/recomp_ids_config and add all-ranks-memory log for f…

03d5c39

…ine-grained tuning (#630)

qwen30b better perf; add kimi k2

f230128

Copilot AI review requested due to automatic review settings March 29, 2026 07:54

Copilot started reviewing on behalf of wenxie-amd March 29, 2026 07:55 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

botaohu001 and others added 2 commits April 3, 2026 19:24

Add MiniMax M2.5 and GLM-5 model configs and training scripts (#643)

c07a3f4

Add training configurations for the Minimax 2.5 and GLAM5 models, including single-node and 8-node/16-node training scripts. Co-authored-by: botaohu001 <Botao.Hu@amd.com>

add dsv3 markdown (#657)

8058367

Copilot AI review requested due to automatic review settings April 10, 2026 05:14

update dsv3 readme

39e7cbb

Copilot started reviewing on behalf of wenxie-amd April 10, 2026 05:15 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

update dsv3 readme

6463172

Xiaoming-AMD closed this Apr 10, 2026

		pipeline_model_parallel_size: ${PRIMUS_PP:8}
		expert_model_parallel_size: ${PRIMUS_EP:8}

-VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")
+VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
+if [ -d "/shared_aig/c4" ]; then
+    VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4")
+fi
+if [[ -n "${EXTRA_VOLUME_ARGS:-}" ]]; then
+    # Allow callers to inject additional volume args, e.g. EXTRA_VOLUME_ARGS='-v /host/path:/container/path'
+    VOLUME_ARGS+=(${EXTRA_VOLUME_ARGS})
+fi

		export HF_TOKEN="your_hf_token" # make it your own hf token
		export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

		export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token
		export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key

		export PRIMUS_EXP_NAME=minimax_m2.5-pretrain-platform_$PLATFORM-layers_$PRIMUS_TOTAL_LAYERS-type_$PRETRAIN_TYPE-mbs_$MBS-gbs_$GBS-legacygg_$LEGACY_GG
		export PRIMUS_EXP_NAME=debug_minimax_m2.5_4layers-type_$PRETRAIN_TYPE-legacygg_$LEGACY_GG-turbogg_$TURBO_GROUPED_MLP

		export PRIMUS_EXP_NAME=dsv3-pretrain-platform_$PLATFORM-layers_$PRIMUS_TOTAL_LAYERS-type_$PRETRAIN_TYPE-mbs_$MBS-gbs_$GBS-legacygg_$LEGACY_GG
		export PRIMUS_EXP_NAME=debug_4layers-type_$PRETRAIN_TYPE-legacygg_$LEGACY_GG-turbogg_$TURBO_GROUPED_MLP

Conversation

Xiaoming-AMD commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment