Skip to content

Dev/tas/moe package v2.0#599

Closed
Xiaoming-AMD wants to merge 47 commits intomainfrom
dev/tas/moe_package_v2.0
Closed

Dev/tas/moe package v2.0#599
Xiaoming-AMD wants to merge 47 commits intomainfrom
dev/tas/moe_package_v2.0

Conversation

@Xiaoming-AMD
Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings March 12, 2026 08:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds/updates Primus Megatron training launch scripts and model configs for MoE workloads (Qwen3 A3B/A22B, DeepSeek V2/V3), plus supporting tooling and infra changes (data prep, Slurm/local launch behavior, logging, CI).

Changes:

  • Add multiple “start_training_*.sh” entrypoint scripts for Slurm-based pretraining runs.
  • Update Megatron YAML configs to enable MoE dispatcher/grouped GEMM and cross-entropy TE fusion in several Qwen3/DeepSeek configs.
  • Adjust runtime tooling: Slurm launcher env handling, local launcher container behavior, training log memory reporting, C4 prep script, and CI workflow.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
start_training_qwen_30B_a3B.sh New Slurm training wrapper for Qwen3-30B-A3B.
start_training_qwen_235B_a22B.sh New Slurm training wrapper for Qwen3-235B-A22B, including PP/VPP layout logic.
start_training_dsv3_4layers_proxy.sh New small-scale DeepSeek-V3 script with platform-based docker selection.
start_training_dsv3.sh New/updated DeepSeek-V3 training wrapper with PP/VPP layout handling.
start_training_dsv2_lite.sh New Slurm training wrapper for DeepSeek-V2-Lite.
primus/configs/models/megatron/qwen2.5_base.yaml Adds MoE dispatcher/grouped GEMM defaults to Qwen base config.
primus/backends/megatron/patches/training_log/print_rank_last_patches.py Enhances log output with max ROCm memory usage across ranks.
prepare_c4_data.sh Adds a script to merge/tokenize C4-en shards for Megatron pretraining.
examples/run_slurm_pretrain.sh Switches srun options to use SLURM_* env vars for time/partition/nodelist.
examples/run_pretrain.sh Adds/changes AINIC networking/NCCL tuning exports.
examples/run_local_pretrain.sh Adds a fixed data mount, image pull-if-missing behavior, and container ulimit.
examples/megatron/prepare.py Adds --no-build-isolation to Emerging-Optimizers editable install.
examples/megatron/configs/MI355X/qwen3_30B_A3B-FP8-pretrain.yaml Enables MoE dispatcher + enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/qwen3_30B_A3B-BF16-pretrain.yaml Enables MoE dispatcher/grouped GEMM + enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/qwen3_235B_A22B-FP8-pretrain.yaml Enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/qwen3_235B_A22B-BF16-pretrain.yaml Enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/deepseek_v3-FP8-pretrain.yaml Adds Turbo/deepep-related settings + enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml Changes default PP size + adds Turbo/deepep settings + enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/deepseek_v2_lite-FP8-pretrain.yaml Enables TE cross-entropy fusion flags.
examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml Enables TE cross-entropy fusion flags.
.github/workflows/ci.yaml Comments out the JAX unit test job.

Comment on lines +7 to +11
if [ "$PLATFORM" = "MI355X" ]; then
export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"
elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then
export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"
EXTRA_ARGS="--use_rocm_mem_info_iters None"
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXTRA_ARGS is built as a single string containing spaces and later passed as "$EXTRA_ARGS", which becomes a single CLI argument (and even an empty argument when unset). This will break flag parsing for run_slurm_pretrain.sh. Use a bash array for optional args (e.g., EXTRA_ARGS=() and append --use_rocm_mem_info_iters and None as separate elements) and expand it with "${EXTRA_ARGS[@]}" (or omit entirely when empty).

Suggested change
if [ "$PLATFORM" = "MI355X" ]; then
export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"
elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then
export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"
EXTRA_ARGS="--use_rocm_mem_info_iters None"
EXTRA_ARGS=()
if [ "$PLATFORM" = "MI355X" ]; then
export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"
elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then
export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"
EXTRA_ARGS=(--use_rocm_mem_info_iters None)

Copilot uses AI. Check for mistakes.
Comment thread examples/run_pretrain.sh
Comment on lines +243 to 261
export NCCL_IB_GID_INDEX=1
# export NCCL_IB_ROCE_VERSION_NUM=2
export NCCL_MAX_P2P_CHANNELS=56
# export NCCL_IB_TC=104
# export NCCL_IB_FIFO_TC=192
export NCCL_IB_TC=41
export NCCL_IB_FIFO_TC=185
export NET_OPTIONAL_RECV_COMPLETION=1
export NCCL_IB_USE_INLINE=1
export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0
export NCCL_GDR_FLUSH_DISABLE=1
export NCCL_DMABUF_ENABLE=0
export NCCL_IGNORE_CPU_AFFINITY=1
export NCCL_IB_QPS_PER_CONNECTION=1

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH

else
export NCCL_IB_GID_INDEX=3
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block re-exports several NCCL/RCCL tuning env vars (including NCCL_IB_GID_INDEX, NCCL_MAX_P2P_CHANNELS, NCCL_IB_TC, NCCL_IB_FIFO_TC) after they were already computed/logged earlier in the same USING_AINIC branch. In particular, forcing NCCL_IB_TC=41/NCCL_IB_FIFO_TC=185 overrides the earlier detection/default logic and makes the earlier logs misleading. Consider removing this duplicate section, or only setting values when not already set so the detection path remains effective.

Suggested change
export NCCL_IB_GID_INDEX=1
# export NCCL_IB_ROCE_VERSION_NUM=2
export NCCL_MAX_P2P_CHANNELS=56
# export NCCL_IB_TC=104
# export NCCL_IB_FIFO_TC=192
export NCCL_IB_TC=41
export NCCL_IB_FIFO_TC=185
export NET_OPTIONAL_RECV_COMPLETION=1
export NCCL_IB_USE_INLINE=1
export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0
export NCCL_GDR_FLUSH_DISABLE=1
export NCCL_DMABUF_ENABLE=0
export NCCL_IGNORE_CPU_AFFINITY=1
export NCCL_IB_QPS_PER_CONNECTION=1
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH
else
export NCCL_IB_GID_INDEX=3
[ -z "${NCCL_IB_GID_INDEX}" ] && export NCCL_IB_GID_INDEX=1
# export NCCL_IB_ROCE_VERSION_NUM=2
[ -z "${NCCL_MAX_P2P_CHANNELS}" ] && export NCCL_MAX_P2P_CHANNELS=56
# export NCCL_IB_TC=104
# export NCCL_IB_FIFO_TC=192
[ -z "${NCCL_IB_TC}" ] && export NCCL_IB_TC=41
[ -z "${NCCL_IB_FIFO_TC}" ] && export NCCL_IB_FIFO_TC=185
[ -z "${NET_OPTIONAL_RECV_COMPLETION}" ] && export NET_OPTIONAL_RECV_COMPLETION=1
[ -z "${NCCL_IB_USE_INLINE}" ] && export NCCL_IB_USE_INLINE=1
[ -z "${RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING}" ] && export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0
[ -z "${NCCL_GDR_FLUSH_DISABLE}" ] && export NCCL_GDR_FLUSH_DISABLE=1
[ -z "${NCCL_DMABUF_ENABLE}" ] && export NCCL_DMABUF_ENABLE=0
[ -z "${NCCL_IGNORE_CPU_AFFINITY}" ] && export NCCL_IGNORE_CPU_AFFINITY=1
[ -z "${NCCL_IB_QPS_PER_CONNECTION}" ] && export NCCL_IB_QPS_PER_CONNECTION=1
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH
else
[ -z "${NCCL_IB_GID_INDEX}" ] && export NCCL_IB_GID_INDEX=3

Copilot uses AI. Check for mistakes.
Comment thread prepare_c4_data.sh
Comment on lines +65 to +91
JSONL_FILE="${JSONL_DIR}/c4_en_train.jsonl"

if [ -f "${JSONL_FILE}" ]; then
echo "JSONL file already exists: ${JSONL_FILE}"
echo "Skipping merge. Delete it to re-merge."
else
# Verify shards exist
MISSING=0
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
if [ ! -f "${RAW_DIR}/${SHARD_NAME}" ]; then
echo " WARNING: Missing shard ${SHARD_NAME}"
MISSING=$((MISSING + 1))
fi
done
if [ "$MISSING" -gt 0 ]; then
echo "ERROR: ${MISSING} shard(s) missing in ${RAW_DIR}. Cannot proceed."
exit 1
fi

echo "Decompressing and merging shards into JSONL ..."
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
zcat "${SHARD_PATH}" >> "${JSONL_FILE}"
done
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSONL merge step appends shards directly into the final c4_en_train.jsonl. If the script is interrupted mid-merge, the partially written JSONL file will exist and subsequent runs will skip merging, leaving a corrupt/incomplete dataset. Write to a temporary file and mv atomically on success (or truncate the target before the loop and add a checksum/size check) to make reruns safe.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to 44
pipeline_model_parallel_size: ${PRIMUS_PP:8}
expert_model_parallel_size: ${PRIMUS_EP:8}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default pipeline_model_parallel_size from ${PRIMUS_PP:1} to ${PRIMUS_PP:8} makes the config default to PP=8 when PRIMUS_PP isn’t set, which will break/behave unexpectedly for single-node or small runs. Consider keeping the default at 1 (and letting training scripts set PRIMUS_PP explicitly), or document very clearly that this config now requires PRIMUS_PP to be set appropriately.

Suggested change
pipeline_model_parallel_size: ${PRIMUS_PP:8}
expert_model_parallel_size: ${PRIMUS_EP:8}
pipeline_model_parallel_size: ${PRIMUS_PP:1}
expert_model_parallel_size: ${PRIMUS_EP:1}

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +48
${SLURM_TIME:+--time="${SLURM_TIME}"} \
${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \
${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to SLURM_NODELIST/SLURM_TIME/SLURM_PARTITION as user-provided overrides is confusing because SLURM_NODELIST is also a Slurm-populated env var inside allocations. That makes it hard to distinguish “allocated nodes” vs “requested override nodes”, and can accidentally force --nodelist when running inside Slurm. Consider keeping the previous NODE_LIST override (or renaming to something like PRIMUS_NODELIST_OVERRIDE) and updating the script to prefer that, falling back to Slurm’s own env vars when present.

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +220
f" | rank-{max_rank} rocm max mem usage/usage_ratio: "
f"{max_r_used / 1024 ** 3:.2f}GB/"
f"{max_r_used / r_total * 100:.2f}%"
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reported rank-{max_rank} rocm max mem ... usage_ratio uses max_r_used / r_total, where r_total is from the local rank, not necessarily max_rank. If total memory differs across GPUs/nodes, this ratio can be incorrect. Consider gathering r_total as well (or computing the ratio on the max-rank and broadcasting it) so the max-rank ratio is self-consistent.

Copilot uses AI. Check for mistakes.
ARGS=("$@")

VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a hard-coded bind mount for /shared_aig/c4 makes run_local_pretrain.sh environment-specific and can fail (or create unexpected host directories) on machines that don’t have that path. Consider making this conditional (only mount if the directory exists) or driven by an env var (e.g., EXTRA_VOLUME_ARGS / PRIMUS_DATA_MOUNTS) so the script remains portable.

Suggested change
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
if [ -d "/shared_aig/c4" ]; then
VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4")
fi
if [[ -n "${EXTRA_VOLUME_ARGS:-}" ]]; then
# Allow callers to inject additional volume args, e.g. EXTRA_VOLUME_ARGS='-v /host/path:/container/path'
VOLUME_ARGS+=(${EXTRA_VOLUME_ARGS})
fi

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/ci.yaml
Comment on lines +308 to +332
# run-unittest-jax:
# env:
# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax
# needs: [code-lint]
# runs-on: [primus-lm-cicd-jax-8t8mh]
# steps:
# - run: echo "🎉 Begin Primus-Turbo Checkout."
# - name: Set commit hash to env
# run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
# - name: Checkout Repo Primus-Turbo
# uses: actions/checkout@v4
# with:
# repository: AMD-AIG-AIMA/Primus-Turbo
# submodules: "recursive"
# path: Primus-Turbo
# ref: ${{ env.PRIMUS_TURBO_COMMIT }}
# - run: echo "Begin Primus-Turbo Install."
# - name: Install Primus-Turbo
# run: |
# mv Primus-Turbo /tmp/
# echo "Primus-Turbo dir: /tmp/Primus-Turbo"
# git config --global --add safe.directory /tmp/Primus-Turbo
# cd /tmp/Primus-Turbo
# start_time=$(date +%s)
# echo "✅ [Pip install requirements] started at: $(date)"
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR comments out the entire run-unittest-jax job, and there doesn’t appear to be any other workflow job running the --jax unit tests after this change. If JAX support is still maintained, this will remove CI coverage and allow regressions to land. Consider fixing the runner selection / installation issue and keeping the job enabled, or replacing it with an equivalent JAX test job.

Suggested change
# run-unittest-jax:
# env:
# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax
# needs: [code-lint]
# runs-on: [primus-lm-cicd-jax-8t8mh]
# steps:
# - run: echo "🎉 Begin Primus-Turbo Checkout."
# - name: Set commit hash to env
# run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
# - name: Checkout Repo Primus-Turbo
# uses: actions/checkout@v4
# with:
# repository: AMD-AIG-AIMA/Primus-Turbo
# submodules: "recursive"
# path: Primus-Turbo
# ref: ${{ env.PRIMUS_TURBO_COMMIT }}
# - run: echo "Begin Primus-Turbo Install."
# - name: Install Primus-Turbo
# run: |
# mv Primus-Turbo /tmp/
# echo "Primus-Turbo dir: /tmp/Primus-Turbo"
# git config --global --add safe.directory /tmp/Primus-Turbo
# cd /tmp/Primus-Turbo
# start_time=$(date +%s)
# echo "✅ [Pip install requirements] started at: $(date)"
run-unittest-jax:
env:
PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax
needs: [code-lint]
runs-on: [primus-lm-cicd-jax-8t8mh]
steps:
- name: Begin JAX unit tests
run: echo "🎉 Begin JAX unit tests in ${PRIMUS_WORKDIR}"
- name: Set commit hash to env
run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install dependencies (including JAX extras)
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
# Attempt to install JAX-related extras if defined; ignore failures
pip install -e ".[jax,test]" || pip install -e ".[jax]" || true
- name: Run JAX unit tests
run: |
pytest --jax

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +75
--profile_step_end 7 \
--profile_step_start 6 \
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRAIN_ITERS is set to 5, but --profile_step_start 6 / --profile_step_end 7 are outside the iteration range. If profiling is enabled (PROFILE=True), this likely results in no profiling data being captured. Consider tying the profile step range to TRAIN_ITERS (or lowering these defaults) so the script works when toggling profiling on.

Suggested change
--profile_step_end 7 \
--profile_step_start 6 \
--profile_step_end $TRAIN_ITERS \
--profile_step_start 1 \

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +135
--profile $PROFILE \
--use_pytorch_profiler $PROFILE \
--profile_step_end 7 \
--profile_step_start 6 \
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRAIN_ITERS is set to 5, but --profile_step_start 6 / --profile_step_end 7 are outside the iteration range. If profiling is enabled (PROFILE=True), this likely results in no profiling data being captured. Consider tying the profile step range to TRAIN_ITERS (or lowering these defaults) so the script works when toggling profiling on.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 13, 2026 08:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 16 comments.

Comment thread examples/run_pretrain.sh
Comment on lines +242 to +249
# unset NCCL_IB_GID_INDEX
export NCCL_IB_GID_INDEX=1
# export NCCL_IB_ROCE_VERSION_NUM=2
export NCCL_MAX_P2P_CHANNELS=56
# export NCCL_IB_TC=104
# export NCCL_IB_FIFO_TC=192
export NCCL_IB_TC=41
export NCCL_IB_FIFO_TC=185
Comment thread start_training_dsv3.sh Outdated
Comment on lines +3 to +4
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Comment thread start_training_dsv3_4layers_proxy.sh Outdated
Comment on lines +3 to +4
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Comment thread prepare_c4_data.sh
Comment on lines +22 to +25
# ======================== Configuration ========================
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
Comment on lines +3 to +4
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
log_info(f"Building Emerging Optimizers in {emerging_optimizers_path}")
ret = subprocess.run(["pip", "install", "-e", str(emerging_optimizers_path)], check=True)
ret = subprocess.run(
["pip", "install", "--no-build-isolation", "-e", str(emerging_optimizers_path)], check=True
Comment thread .github/workflows/ci.yaml
Comment on lines +308 to +332
# run-unittest-jax:
# env:
# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax
# needs: [code-lint]
# runs-on: [primus-lm-cicd-jax-8t8mh]
# steps:
# - run: echo "🎉 Begin Primus-Turbo Checkout."
# - name: Set commit hash to env
# run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
# - name: Checkout Repo Primus-Turbo
# uses: actions/checkout@v4
# with:
# repository: AMD-AIG-AIMA/Primus-Turbo
# submodules: "recursive"
# path: Primus-Turbo
# ref: ${{ env.PRIMUS_TURBO_COMMIT }}
# - run: echo "Begin Primus-Turbo Install."
# - name: Install Primus-Turbo
# run: |
# mv Primus-Turbo /tmp/
# echo "Primus-Turbo dir: /tmp/Primus-Turbo"
# git config --global --add safe.directory /tmp/Primus-Turbo
# cd /tmp/Primus-Turbo
# start_time=$(date +%s)
# echo "✅ [Pip install requirements] started at: $(date)"
Comment thread start_training_dsv3.sh
Comment on lines +9 to +11
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
Comment on lines +46 to +48
${SLURM_TIME:+--time="${SLURM_TIME}"} \
${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \
${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \
Comment on lines +26 to +28
moe_token_dispatcher_type: alltoall

moe_grouped_gemm: true
Copilot AI review requested due to automatic review settings March 16, 2026 01:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 13 comments.

Comments suppressed due to low confidence (1)

primus/configs/models/megatron/qwen2.5_base.yaml:34

  • This YAML now defines moe_token_dispatcher_type and moe_grouped_gemm twice (once around lines 26–28 and again around lines 32–35). Duplicate keys in YAML are ambiguous (parsers typically take the last value) and can hide configuration mistakes. Please remove the redundant earlier entries or consolidate them into a single block.
moe_token_dispatcher_type: alltoall

moe_grouped_gemm: true
apply_rope_fusion: true
masked_softmax_fusion: false

moe_token_dispatcher_type: alltoall
moe_grouped_gemm: true
moe_use_legacy_grouped_gemm: false

Comment thread start_training_dsv3_4layers_proxy.sh Outdated
Comment on lines +91 to +94
--disable_wandb True \
--disable_tensorboard True \
"$EXTRA_ARGS" \
2>&1 | tee output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME/log.txt
Comment on lines +3 to +4
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Comment on lines 42 to +48
srun -N "${NNODES}" \
--exclusive \
--export ALL \
${NODE_LIST:+--nodelist="${NODE_LIST}"} \
--ntasks-per-node=1 \
${SLURM_TIME:+--time="${SLURM_TIME}"} \
${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \
${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \
Comment thread start_training_qwen_30B_a3B.sh Outdated
Comment on lines +3 to +4
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Comment thread start_training_dsv3.sh Outdated
Comment on lines +3 to +5
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

Comment thread .github/workflows/ci.yaml
Comment on lines +308 to +313
# run-unittest-jax:
# env:
# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax
# needs: [code-lint]
# runs-on: [primus-lm-cicd-jax-8t8mh]
# steps:
Comment on lines +3 to +4
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Comment thread start_training_dsv3.sh
Comment on lines +9 to +11
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
Comment thread examples/run_pretrain.sh
Comment on lines +248 to +249
export NCCL_IB_TC=41
export NCCL_IB_FIFO_TC=185
Comment thread start_training_dsv3_4layers_proxy.sh Outdated
Comment on lines +3 to +5
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key

Copilot AI review requested due to automatic review settings March 16, 2026 02:43
wenxie-amd and others added 2 commits March 24, 2026 02:31
Remove the --debug gate so users always see the complete
stack trace alongside the one-line summary, reducing
back-and-forth when diagnosing failures.

Made-with: Cursor
Copilot AI review requested due to automatic review settings March 24, 2026 04:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 159 out of 159 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (13)

runner/primus-cli-container.sh:1

  • This now hard-fails when neither docker nor podman is present, even in dry-run mode. The previous logic explicitly allowed a mock runtime during dry-runs; without that, --dry-run becomes unusable on hosts lacking container tooling. Consider restoring the dry-run fallback (mock runtime) or skipping runtime validation when DRY_RUN_MODE==true.
    primus/modules/trainer/megatron/trainer.py:1
  • The new defaulting logic does not handle the common case where args.moe_layer_freq exists but is None. In that case, the value remains None and downstream code expecting an int/list can break. Suggest setting the default when getattr(args, \"moe_layer_freq\", None) is None (not only when the attribute is missing).
    runner/helpers/envs/base_env.sh:1
  • The comments and defaults are now inconsistent: HSA_NO_SCRATCH_RECLAIM is defaulting to 1, but the comment claims setting it to 0 prevents core dumps; similarly the TE comment says to disable v3 due to accuracy issues, but the default is now 1 (enabled). Please update the comments to match the new defaults/behavior, or revert the defaults to align with the documented guidance.
    runner/helpers/envs/base_env.sh:1
  • The comments and defaults are now inconsistent: HSA_NO_SCRATCH_RECLAIM is defaulting to 1, but the comment claims setting it to 0 prevents core dumps; similarly the TE comment says to disable v3 due to accuracy issues, but the default is now 1 (enabled). Please update the comments to match the new defaults/behavior, or revert the defaults to align with the documented guidance.
    start_training_dsv3.sh:1
  • This script exports SLURM_NODELIST four times (only the last value takes effect), and one intermediate value contains -1557 inside the bracket expression, which looks like an invalid node/range token. Please remove the redundant exports and keep a single valid SLURM_NODELIST (or make the selection conditional via an argument/env var).
    start_training_dsv3_4layers_proxy.sh:1
  • The first PRIMUS_EXP_NAME assignment is immediately overwritten, so it will never be used. This makes it hard to tell which experiment naming convention is intended. Consider making the debug name conditional (e.g., gated on an env var) or removing the unused assignment.
    primus/backends/megatron/patches/training_log/print_rank_last_patches.py:1
  • Using all_gather allocates O(world_size) tensors and moves a list of values to every rank on each sample, which can add noticeable overhead at large scale. If you only need the maximum, consider an all_reduce(MAX) on r_used_tensor. If you also want the rank of the max, consider reducing a (value, rank) pair via a custom reduction or gather only on rank0.
    primus/backends/megatron/patches/pp_warmup_patches.py:1
  • This warmup implementation assumes a very specific model structure (model iterable of chunks, then model_chunk.module.module.decoder.layers) and a specific per-layer forward signature/return shape (dummy_output, _ = layer.forward(...)). This is likely to break for non-GPT models (e.g., Mamba), encoder-decoder variants, or Megatron version differences, causing pre-train crashes when pp_warmup is enabled. Recommend adding robust feature-detection (validate attribute chain + callable signature) and falling back to a safer warmup (e.g., a single dummy forward/backward through the normal forward_step_func), or gating the patch to known-compatible backend/model_types.
    primus/backends/megatron/patches/pp_warmup_patches.py:1
  • This warmup implementation assumes a very specific model structure (model iterable of chunks, then model_chunk.module.module.decoder.layers) and a specific per-layer forward signature/return shape (dummy_output, _ = layer.forward(...)). This is likely to break for non-GPT models (e.g., Mamba), encoder-decoder variants, or Megatron version differences, causing pre-train crashes when pp_warmup is enabled. Recommend adding robust feature-detection (validate attribute chain + callable signature) and falling back to a safer warmup (e.g., a single dummy forward/backward through the normal forward_step_func), or gating the patch to known-compatible backend/model_types.
    runner/helpers/hooks/03_enable_ainic.sh:1
  • These settings are now hard-coded and will override any values a user may have set in the environment. Previously the script used ${VAR:-default} patterns, which allowed overrides for cluster-specific tuning. Consider restoring the override-friendly defaults (e.g., NCCL_IB_TC=${NCCL_IB_TC:-41} etc.) so operators can tune without editing the hook.
    primus/backends/maxtext/patches/max_utils_patches.py:1
  • This monkey-patches jax.distributed.initialize globally and does not restore it. In long-lived processes (or if multiple backends/modules are initialized in one process), this can leak behavior across unrelated code and make debugging version-specific JAX issues harder. Consider restoring the original function after MaxText initialization (e.g., patch only within the initialization call scope) or attaching the wrapper in a way that can be cleanly undone during teardown.
    primus/cli/main.py:1
  • The exception handler now always prints the full traceback via traceback.print_exc(), then also prints a summarized error line. This changes the CLI behavior from the previous --debug-gated verbosity and can significantly increase log volume (and potentially expose internal paths) in production runs. Consider restoring the --debug gating (or an env flag) for full tracebacks, while keeping the concise error+location output for non-debug runs.
    primus/cli/main.py:1
  • The exception handler now always prints the full traceback via traceback.print_exc(), then also prints a summarized error line. This changes the CLI behavior from the previous --debug-gated verbosity and can significantly increase log volume (and potentially expose internal paths) in production runs. Consider restoring the --debug gating (or an env flag) for full tracebacks, while keeping the concise error+location output for non-debug runs.

wenxie-amd and others added 2 commits March 24, 2026 06:12
add opt config

---------

Co-authored-by: HuangWei-95 <weihuan@amd.com>
Copilot AI review requested due to automatic review settings March 25, 2026 00:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@HuangWei-95 HuangWei-95 requested a review from Copilot March 25, 2026 00:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

lhzhang333 and others added 2 commits March 25, 2026 15:53
Co-authored-by: HuangWei-95 <weihuan@amd.com>
Copilot AI review requested due to automatic review settings March 26, 2026 02:15
Added configuration modifications to qwen3_30B, dated March 24th.
import megatron.core.pipeline_parallel as pp_module
import megatron.training.training as training_module

import primus.modules.trainer.megatron.utils as utils
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 162 out of 162 changed files in this pull request and generated 6 comments.

Comment thread start_training_dsv3.sh
Comment on lines +9 to +12
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]"
export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1683-1684,1691,1697]"
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exports SLURM_NODELIST four times in a row; the first three assignments are dead code and one contains a malformed token (...,1536,-1557,1561,...). Keeping only a single SLURM_NODELIST (or making it an input override) will avoid confusion and accidental invalid nodelists.

Copilot uses AI. Check for mistakes.
Comment thread start_training_dsv3.sh
Comment on lines +52 to +60
8)
FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*7|t*8|t*8|t*8|t*8|t*8|t*7|t*7,L'")
;;
16)
FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*3|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*3|t*3,L'")
;;
32)
FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*1|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*1|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*1,L'")
;;
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FEATURE_ARGS values include embedded single quotes (e.g., "'Et*7|...|...,L'"). Because the value is already quoted by Bash, those inner quotes will be passed literally to the CLI and may break pipeline_model_parallel_layout parsing. Pass the layout string without the extra '...' wrapper.

Copilot uses AI. Check for mistakes.
Comment on lines 42 to +48
srun -N "${NNODES}" \
--exclusive \
--export ALL \
${NODE_LIST:+--nodelist="${NODE_LIST}"} \
--ntasks-per-node=1 \
${SLURM_TIME:+--time="${SLURM_TIME}"} \
${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \
${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This srun invocation now uses SLURM_TIME, SLURM_NODELIST, and SLURM_PARTITION, but the script’s --help section (not shown in this hunk) still documents NODE_LIST. Please update the help text to match the new env var names/options to avoid user confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +88
args = get_megatron_args()
if getattr(args, "pp_warmup", True):
run_pp_warmup(model, config, args, optimizer, get_timers())
return original_train(
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getattr(args, "pp_warmup", True) defaults to True, so PP warmup will run even when the flag is absent. This is risky because warmup does extra forward/backward work and may break non-GPT models. Default this to False (or require the arg to be explicitly set), and add a one-time guard (e.g., closure flag) so warmup truly runs only once as stated in the docstring.

Copilot uses AI. Check for mistakes.
Comment on lines 220 to 228
# Validate container runtime (docker/podman)
# In dry-run mode, allow mock runtime if docker/podman not available
if [[ "$DRY_RUN_MODE" == "true" ]]; then
if command -v podman >/dev/null 2>&1; then
export CONTAINER_RUNTIME="podman"
elif command -v docker >/dev/null 2>&1; then
export CONTAINER_RUNTIME="docker"
else
# Mock runtime for dry-run testing
export CONTAINER_RUNTIME="docker"
LOG_INFO_RANK0 "[container] Using mock container runtime for dry-run (no docker/podman found)"
fi
if command -v docker >/dev/null 2>&1; then
export CONTAINER_RUNTIME="docker"
elif command -v podman >/dev/null 2>&1; then
export CONTAINER_RUNTIME="podman"
else
validate_container_runtime
LOG_ERROR "[container] Neither docker nor podman is available in PATH."
exit 1
fi
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This container-runtime validation now hard-fails when neither docker nor podman is on PATH, even in --dry-run mode. That defeats dry-run usage on login/CI hosts without container tooling. Consider keeping the previous behavior: in dry-run allow a mock/default runtime (or skip runtime validation entirely) while still validating in real runs.

Copilot uses AI. Check for mistakes.
Comment on lines 222 to +223
# Note: Disable v3 due to accuracy issues. Will fix after TE version 2.1.
export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-0}
export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-1}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says to disable v3 due to accuracy issues, but the default is now set to NVTE_CK_USES_BWD_V3:-1 (enabled). Either update the comment to match the new default rationale, or keep the default disabled (:-0) and enable only via explicit opt-in.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 29, 2026 07:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 165 out of 165 changed files in this pull request and generated 9 comments.

Comment on lines +201 to +211
# When pipeline parallelism (PP) is enabled, memory usage can vary across ranks.
# Therefore, we report the maximum ROCm memory usage across all ranks.
r_used_tensor = torch.tensor([r_used], device="cuda", dtype=torch.int64)
world_size = torch.distributed.get_world_size()
gathered_r_used = [torch.zeros_like(r_used_tensor) for _ in range(world_size)]
torch.distributed.all_gather(gathered_r_used, r_used_tensor)

total_r_used = [t.item() for t in gathered_r_used]
log_rank_0(f"total_r_used: {[round(r_used / 1024 ** 3, 2) for r_used in total_r_used]}")
max_r_used = max(total_r_used)
max_rank = total_r_used.index(max_r_used)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collecting per-rank ROCm memory via all_gather and logging the full total_r_used list can add noticeable overhead and produce very large logs at scale. Consider using a cheaper collective (e.g., reduce/max) and only logging the aggregate (or gate the per-rank list behind a debug flag).

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +39
raise ValueError(
f'When using recompute_layer_ids, recompute_granuarlity: {args.recompute_granularity} must be "full"'
)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the error message: recompute_granuarlity should be recompute_granularity (and matching spelling makes grep/debugging easier).

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +6
set -x

export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token
export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -x will echo all commands, including the HF/W&B token exports. This can leak credentials into logs; disable xtrace by default or only enable it after sensitive env vars are set (or guard it behind a DEBUG flag).

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +6
export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token
export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default values for HF_TOKEN/WANDB_API_KEY include single quotes inside the parameter expansion (e.g. :-'your_hf_token'), so the resulting env var value will literally contain quotes. That will break authentication unless users manually remove them; use an unquoted placeholder (or empty default) instead.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +5
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic"
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding tokens in the script (export HF_TOKEN="your_hf_token", etc.) makes it easy to accidentally commit real credentials after editing. Prefer the ${VAR:-...} pattern (or require them to be provided externally) and avoid writing secrets into the repo.

Copilot uses AI. Check for mistakes.
Comment on lines 55 to 58
# Select and instantiate the tokenizer.
if args.tokenizer_type in CUSTOM_TOKENIZER_TYPES:
tokenizer = _HuggingFaceTokenizer(args.tokenizer_model)
tokenizer = _HuggingFaceTokenizer(args.tokenizer_model, trust_remote_code=True)
else:
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling trust_remote_code=True for all HuggingFace tokenizers will execute arbitrary code from remote model repos. This is a significant supply-chain risk; make this opt-in via a config/CLI flag (default False) or restrict it to the specific tokenizers that require it.

Copilot uses AI. Check for mistakes.
Comment on lines 179 to 183
# Prevent scratch memory from being reclaimed to stabilize large memory usage
# NOTE: Must disable scratch reclaim to avoid MoE training crash on AMD GPUs
# Setting this to 0 prevents core dumps when using Mixture-of-Experts (MoE) models
export HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM:-0}
export HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM:-1}

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says “Setting this to 0 prevents core dumps…”, but the default is now HSA_NO_SCRATCH_RECLAIM:-1. Either the comment is now incorrect or the default should remain 0; please align the comment with the actual behavior to avoid misconfiguration.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +4
export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token
export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default values for HF_TOKEN/WANDB_API_KEY include single quotes inside the parameter expansion, so the env var value will literally contain quotes. This will break authentication unless edited; use an unquoted placeholder (or empty default) instead.

Copilot uses AI. Check for mistakes.
Comment on lines 222 to 224
# Note: Disable v3 due to accuracy issues. Will fix after TE version 2.1.
export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-0}
export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-1}

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states “Disable v3 due to accuracy issues”, but the default was changed to NVTE_CK_USES_BWD_V3:-1, which enables it. Please either revert the default or update the comment to reflect the intended default/risks.

Copilot uses AI. Check for mistakes.
botaohu001 and others added 2 commits April 3, 2026 19:24
Add training configurations for the Minimax 2.5 and GLAM5 models,
including single-node and 8-node/16-node training scripts.

Co-authored-by: botaohu001 <Botao.Hu@amd.com>
Copilot AI review requested due to automatic review settings April 10, 2026 05:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 175 out of 176 changed files in this pull request and generated 9 comments.

Comment on lines 86 to 92
def new_post_init(self):
tmp = getattr(self, "recompute_granularity", None)
self.recompute_granularity = None
orig_post_init(self)
self.recompute_granularity = tmp
validate_specified_recompute_layers(config_mod.TransformerConfig, args)

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In new_post_init(), validate_specified_recompute_layers() is called with config_mod.TransformerConfig (the class) instead of the config instance (self). This will validate/mutate class attributes rather than the actual run’s config and can raise attribute errors or silently ignore the user-provided recompute_layer_ids. Pass the instance (self) into validate_specified_recompute_layers so it validates the effective config for this run.

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +65
if self.config.packing and self.config.dataset_type != "synthetic":
if decoder_segment_ids is None:
decoder_segment_ids = jnp.ones(shape=query.shape[:2], dtype=jnp.int32)
attn_mask = SequenceDescriptor.from_segment_ids_and_pos(
segment_ids=decoder_segment_ids, segment_pos=None
)
qkv_layout = "THD_THD_THD" # 'T3HD', 'THD_T2HD' or 'THD_THD_THD'
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimusAttentionOp.cudnn_flash_attention references SequenceDescriptor but it is never imported in this module, which will raise NameError when packing is enabled. Import SequenceDescriptor from the appropriate MaxText/TE module (matching the legacy MaxText API) or avoid referencing it here.

Copilot uses AI. Check for mistakes.
Comment on lines 220 to +224
# Validate container runtime (docker/podman)
# In dry-run mode, allow mock runtime if docker/podman not available
if [[ "$DRY_RUN_MODE" == "true" ]]; then
if command -v podman >/dev/null 2>&1; then
export CONTAINER_RUNTIME="podman"
elif command -v docker >/dev/null 2>&1; then
export CONTAINER_RUNTIME="docker"
else
# Mock runtime for dry-run testing
export CONTAINER_RUNTIME="docker"
LOG_INFO_RANK0 "[container] Using mock container runtime for dry-run (no docker/podman found)"
fi
if command -v docker >/dev/null 2>&1; then
export CONTAINER_RUNTIME="docker"
elif command -v podman >/dev/null 2>&1; then
export CONTAINER_RUNTIME="podman"
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container runtime detection now hard-fails if neither docker nor podman is in PATH. This also blocks --dry-run mode, which previously could be used for command-generation/testing on systems without a container runtime. Consider skipping runtime validation (or keeping the prior mock-runtime fallback) when DRY_RUN_MODE=true.

Copilot uses AI. Check for mistakes.
Comment thread primus/cli/main.py
Comment on lines 173 to +178
except SystemExit:
raise
except Exception:
# Torchrun/elastic can sometimes suppress worker tracebacks.
# Print here so users can see the real root cause.
if getattr(args, "debug", False):
traceback.print_exc()
else:
print(f"[Primus] Error: {traceback.format_exc().splitlines()[-1]}", file=sys.stderr)
print("[Primus] Re-run with --debug for full stack trace.", file=sys.stderr)
traceback.print_exc()
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI now always prints a full traceback (traceback.print_exc()) on any exception, regardless of the --debug flag. This makes --debug ineffective and can produce extremely noisy logs (especially under torchrun/elastic). Restore the prior behavior: print full traceback only when --debug is set, otherwise emit a concise error with a hint to re-run with --debug.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +4
export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token
export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HF_TOKEN/WANDB_API_KEY defaults include single quotes inside the parameter expansion (e.g., ${HF_TOKEN:-'your_hf_token'}), which will literally set the env var value to a quoted string when unset. Remove the inner single quotes so the default is a usable placeholder value.

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +60
export PRIMUS_EXP_NAME=minimax_m2.5-pretrain-platform_$PLATFORM-layers_$PRIMUS_TOTAL_LAYERS-type_$PRETRAIN_TYPE-mbs_$MBS-gbs_$GBS-legacygg_$LEGACY_GG
export PRIMUS_EXP_NAME=debug_minimax_m2.5_4layers-type_$PRETRAIN_TYPE-legacygg_$LEGACY_GG-turbogg_$TURBO_GROUPED_MLP
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRIMUS_EXP_NAME is assigned twice; the second assignment overwrites the first and forces a debug name. If this is meant to be optional, gate it behind a flag or comment it out; otherwise remove the redundant assignment so runs get the expected experiment naming/output directory.

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +61
export PRIMUS_EXP_NAME=dsv3-pretrain-platform_$PLATFORM-layers_$PRIMUS_TOTAL_LAYERS-type_$PRETRAIN_TYPE-mbs_$MBS-gbs_$GBS-legacygg_$LEGACY_GG
export PRIMUS_EXP_NAME=debug_4layers-type_$PRETRAIN_TYPE-legacygg_$LEGACY_GG-turbogg_$TURBO_GROUPED_MLP
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRIMUS_EXP_NAME is assigned twice; the second assignment overwrites the first and forces a debug name. If this is meant to be optional, gate it behind a flag or comment it out; otherwise remove the redundant assignment so runs get the expected experiment naming/output directory.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +12
elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then
export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"
EXTRA_ARGS="--use_rocm_mem_info_iters None"
else
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXTRA_ARGS is assigned as a single string containing a space ("--use_rocm_mem_info_iters None") and later passed as "${EXTRA_ARGS}". When quoted, this becomes one CLI argument (with an embedded space) instead of two separate arguments, which will likely break argparse parsing. Use an array for EXTRA_ARGS (and expand it as "${EXTRA_ARGS[@]}"), or avoid quoting and ensure correct word-splitting.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +12
elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then
export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09"
EXTRA_ARGS="--use_rocm_mem_info_iters None"
else
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXTRA_ARGS is assigned as a single string containing a space ("--use_rocm_mem_info_iters None") and later passed as "${EXTRA_ARGS}". When quoted, this becomes one CLI argument (with an embedded space) instead of two separate arguments, which will likely break argparse parsing. Use an array for EXTRA_ARGS (and expand it as "${EXTRA_ARGS[@]}"), or avoid quoting and ensure correct word-splitting.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants