Conversation
4layers proxy HSA_NO_SCRATCH_RECLAIM=1 NVTE_CK_USES_BWD_V3=1 GPU_MAX_HW_QUEUES=4
There was a problem hiding this comment.
Pull request overview
Adds/updates Primus Megatron training launch scripts and model configs for MoE workloads (Qwen3 A3B/A22B, DeepSeek V2/V3), plus supporting tooling and infra changes (data prep, Slurm/local launch behavior, logging, CI).
Changes:
- Add multiple “start_training_*.sh” entrypoint scripts for Slurm-based pretraining runs.
- Update Megatron YAML configs to enable MoE dispatcher/grouped GEMM and cross-entropy TE fusion in several Qwen3/DeepSeek configs.
- Adjust runtime tooling: Slurm launcher env handling, local launcher container behavior, training log memory reporting, C4 prep script, and CI workflow.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| start_training_qwen_30B_a3B.sh | New Slurm training wrapper for Qwen3-30B-A3B. |
| start_training_qwen_235B_a22B.sh | New Slurm training wrapper for Qwen3-235B-A22B, including PP/VPP layout logic. |
| start_training_dsv3_4layers_proxy.sh | New small-scale DeepSeek-V3 script with platform-based docker selection. |
| start_training_dsv3.sh | New/updated DeepSeek-V3 training wrapper with PP/VPP layout handling. |
| start_training_dsv2_lite.sh | New Slurm training wrapper for DeepSeek-V2-Lite. |
| primus/configs/models/megatron/qwen2.5_base.yaml | Adds MoE dispatcher/grouped GEMM defaults to Qwen base config. |
| primus/backends/megatron/patches/training_log/print_rank_last_patches.py | Enhances log output with max ROCm memory usage across ranks. |
| prepare_c4_data.sh | Adds a script to merge/tokenize C4-en shards for Megatron pretraining. |
| examples/run_slurm_pretrain.sh | Switches srun options to use SLURM_* env vars for time/partition/nodelist. |
| examples/run_pretrain.sh | Adds/changes AINIC networking/NCCL tuning exports. |
| examples/run_local_pretrain.sh | Adds a fixed data mount, image pull-if-missing behavior, and container ulimit. |
| examples/megatron/prepare.py | Adds --no-build-isolation to Emerging-Optimizers editable install. |
| examples/megatron/configs/MI355X/qwen3_30B_A3B-FP8-pretrain.yaml | Enables MoE dispatcher + enables TE cross-entropy fusion flags. |
| examples/megatron/configs/MI355X/qwen3_30B_A3B-BF16-pretrain.yaml | Enables MoE dispatcher/grouped GEMM + enables TE cross-entropy fusion flags. |
| examples/megatron/configs/MI355X/qwen3_235B_A22B-FP8-pretrain.yaml | Enables TE cross-entropy fusion flags. |
| examples/megatron/configs/MI355X/qwen3_235B_A22B-BF16-pretrain.yaml | Enables TE cross-entropy fusion flags. |
| examples/megatron/configs/MI355X/deepseek_v3-FP8-pretrain.yaml | Adds Turbo/deepep-related settings + enables TE cross-entropy fusion flags. |
| examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml | Changes default PP size + adds Turbo/deepep settings + enables TE cross-entropy fusion flags. |
| examples/megatron/configs/MI355X/deepseek_v2_lite-FP8-pretrain.yaml | Enables TE cross-entropy fusion flags. |
| examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml | Enables TE cross-entropy fusion flags. |
| .github/workflows/ci.yaml | Comments out the JAX unit test job. |
| if [ "$PLATFORM" = "MI355X" ]; then | ||
| export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic" | ||
| elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then | ||
| export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09" | ||
| EXTRA_ARGS="--use_rocm_mem_info_iters None" |
There was a problem hiding this comment.
EXTRA_ARGS is built as a single string containing spaces and later passed as "$EXTRA_ARGS", which becomes a single CLI argument (and even an empty argument when unset). This will break flag parsing for run_slurm_pretrain.sh. Use a bash array for optional args (e.g., EXTRA_ARGS=() and append --use_rocm_mem_info_iters and None as separate elements) and expand it with "${EXTRA_ARGS[@]}" (or omit entirely when empty).
| if [ "$PLATFORM" = "MI355X" ]; then | |
| export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic" | |
| elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then | |
| export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09" | |
| EXTRA_ARGS="--use_rocm_mem_info_iters None" | |
| EXTRA_ARGS=() | |
| if [ "$PLATFORM" = "MI355X" ]; then | |
| export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic" | |
| elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then | |
| export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09" | |
| EXTRA_ARGS=(--use_rocm_mem_info_iters None) |
| export NCCL_IB_GID_INDEX=1 | ||
| # export NCCL_IB_ROCE_VERSION_NUM=2 | ||
| export NCCL_MAX_P2P_CHANNELS=56 | ||
| # export NCCL_IB_TC=104 | ||
| # export NCCL_IB_FIFO_TC=192 | ||
| export NCCL_IB_TC=41 | ||
| export NCCL_IB_FIFO_TC=185 | ||
| export NET_OPTIONAL_RECV_COMPLETION=1 | ||
| export NCCL_IB_USE_INLINE=1 | ||
| export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0 | ||
| export NCCL_GDR_FLUSH_DISABLE=1 | ||
| export NCCL_DMABUF_ENABLE=0 | ||
| export NCCL_IGNORE_CPU_AFFINITY=1 | ||
| export NCCL_IB_QPS_PER_CONNECTION=1 | ||
|
|
||
| export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH | ||
|
|
||
| else | ||
| export NCCL_IB_GID_INDEX=3 |
There was a problem hiding this comment.
This block re-exports several NCCL/RCCL tuning env vars (including NCCL_IB_GID_INDEX, NCCL_MAX_P2P_CHANNELS, NCCL_IB_TC, NCCL_IB_FIFO_TC) after they were already computed/logged earlier in the same USING_AINIC branch. In particular, forcing NCCL_IB_TC=41/NCCL_IB_FIFO_TC=185 overrides the earlier detection/default logic and makes the earlier logs misleading. Consider removing this duplicate section, or only setting values when not already set so the detection path remains effective.
| export NCCL_IB_GID_INDEX=1 | |
| # export NCCL_IB_ROCE_VERSION_NUM=2 | |
| export NCCL_MAX_P2P_CHANNELS=56 | |
| # export NCCL_IB_TC=104 | |
| # export NCCL_IB_FIFO_TC=192 | |
| export NCCL_IB_TC=41 | |
| export NCCL_IB_FIFO_TC=185 | |
| export NET_OPTIONAL_RECV_COMPLETION=1 | |
| export NCCL_IB_USE_INLINE=1 | |
| export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0 | |
| export NCCL_GDR_FLUSH_DISABLE=1 | |
| export NCCL_DMABUF_ENABLE=0 | |
| export NCCL_IGNORE_CPU_AFFINITY=1 | |
| export NCCL_IB_QPS_PER_CONNECTION=1 | |
| export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH | |
| else | |
| export NCCL_IB_GID_INDEX=3 | |
| [ -z "${NCCL_IB_GID_INDEX}" ] && export NCCL_IB_GID_INDEX=1 | |
| # export NCCL_IB_ROCE_VERSION_NUM=2 | |
| [ -z "${NCCL_MAX_P2P_CHANNELS}" ] && export NCCL_MAX_P2P_CHANNELS=56 | |
| # export NCCL_IB_TC=104 | |
| # export NCCL_IB_FIFO_TC=192 | |
| [ -z "${NCCL_IB_TC}" ] && export NCCL_IB_TC=41 | |
| [ -z "${NCCL_IB_FIFO_TC}" ] && export NCCL_IB_FIFO_TC=185 | |
| [ -z "${NET_OPTIONAL_RECV_COMPLETION}" ] && export NET_OPTIONAL_RECV_COMPLETION=1 | |
| [ -z "${NCCL_IB_USE_INLINE}" ] && export NCCL_IB_USE_INLINE=1 | |
| [ -z "${RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING}" ] && export RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0 | |
| [ -z "${NCCL_GDR_FLUSH_DISABLE}" ] && export NCCL_GDR_FLUSH_DISABLE=1 | |
| [ -z "${NCCL_DMABUF_ENABLE}" ] && export NCCL_DMABUF_ENABLE=0 | |
| [ -z "${NCCL_IGNORE_CPU_AFFINITY}" ] && export NCCL_IGNORE_CPU_AFFINITY=1 | |
| [ -z "${NCCL_IB_QPS_PER_CONNECTION}" ] && export NCCL_IB_QPS_PER_CONNECTION=1 | |
| export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/libibverbs:${RCCL_HOME_DIR}/build/release:${ANP_HOME_DIR}/build:${MPI_HOME_DIR}/lib:$LD_LIBRARY_PATH | |
| else | |
| [ -z "${NCCL_IB_GID_INDEX}" ] && export NCCL_IB_GID_INDEX=3 |
| JSONL_FILE="${JSONL_DIR}/c4_en_train.jsonl" | ||
|
|
||
| if [ -f "${JSONL_FILE}" ]; then | ||
| echo "JSONL file already exists: ${JSONL_FILE}" | ||
| echo "Skipping merge. Delete it to re-merge." | ||
| else | ||
| # Verify shards exist | ||
| MISSING=0 | ||
| for i in $(seq 0 $((NUM_SHARDS - 1))); do | ||
| SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i") | ||
| if [ ! -f "${RAW_DIR}/${SHARD_NAME}" ]; then | ||
| echo " WARNING: Missing shard ${SHARD_NAME}" | ||
| MISSING=$((MISSING + 1)) | ||
| fi | ||
| done | ||
| if [ "$MISSING" -gt 0 ]; then | ||
| echo "ERROR: ${MISSING} shard(s) missing in ${RAW_DIR}. Cannot proceed." | ||
| exit 1 | ||
| fi | ||
|
|
||
| echo "Decompressing and merging shards into JSONL ..." | ||
| for i in $(seq 0 $((NUM_SHARDS - 1))); do | ||
| SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i") | ||
| SHARD_PATH="${RAW_DIR}/${SHARD_NAME}" | ||
| echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..." | ||
| zcat "${SHARD_PATH}" >> "${JSONL_FILE}" | ||
| done |
There was a problem hiding this comment.
The JSONL merge step appends shards directly into the final c4_en_train.jsonl. If the script is interrupted mid-merge, the partially written JSONL file will exist and subsequent runs will skip merging, leaving a corrupt/incomplete dataset. Write to a temporary file and mv atomically on success (or truncate the target before the loop and add a checksum/size check) to make reruns safe.
| pipeline_model_parallel_size: ${PRIMUS_PP:8} | ||
| expert_model_parallel_size: ${PRIMUS_EP:8} |
There was a problem hiding this comment.
Changing the default pipeline_model_parallel_size from ${PRIMUS_PP:1} to ${PRIMUS_PP:8} makes the config default to PP=8 when PRIMUS_PP isn’t set, which will break/behave unexpectedly for single-node or small runs. Consider keeping the default at 1 (and letting training scripts set PRIMUS_PP explicitly), or document very clearly that this config now requires PRIMUS_PP to be set appropriately.
| pipeline_model_parallel_size: ${PRIMUS_PP:8} | |
| expert_model_parallel_size: ${PRIMUS_EP:8} | |
| pipeline_model_parallel_size: ${PRIMUS_PP:1} | |
| expert_model_parallel_size: ${PRIMUS_EP:1} |
| ${SLURM_TIME:+--time="${SLURM_TIME}"} \ | ||
| ${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \ | ||
| ${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \ |
There was a problem hiding this comment.
Switching to SLURM_NODELIST/SLURM_TIME/SLURM_PARTITION as user-provided overrides is confusing because SLURM_NODELIST is also a Slurm-populated env var inside allocations. That makes it hard to distinguish “allocated nodes” vs “requested override nodes”, and can accidentally force --nodelist when running inside Slurm. Consider keeping the previous NODE_LIST override (or renaming to something like PRIMUS_NODELIST_OVERRIDE) and updating the script to prefer that, falling back to Slurm’s own env vars when present.
| f" | rank-{max_rank} rocm max mem usage/usage_ratio: " | ||
| f"{max_r_used / 1024 ** 3:.2f}GB/" | ||
| f"{max_r_used / r_total * 100:.2f}%" |
There was a problem hiding this comment.
The reported rank-{max_rank} rocm max mem ... usage_ratio uses max_r_used / r_total, where r_total is from the local rank, not necessarily max_rank. If total memory differs across GPUs/nodes, this ratio can be incorrect. Consider gathering r_total as well (or computing the ratio on the max-rank and broadcasting it) so the max-rank ratio is self-consistent.
| ARGS=("$@") | ||
|
|
||
| VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH") | ||
| VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4") |
There was a problem hiding this comment.
Adding a hard-coded bind mount for /shared_aig/c4 makes run_local_pretrain.sh environment-specific and can fail (or create unexpected host directories) on machines that don’t have that path. Consider making this conditional (only mount if the directory exists) or driven by an env var (e.g., EXTRA_VOLUME_ARGS / PRIMUS_DATA_MOUNTS) so the script remains portable.
| VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4") | |
| VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH") | |
| if [ -d "/shared_aig/c4" ]; then | |
| VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4") | |
| fi | |
| if [[ -n "${EXTRA_VOLUME_ARGS:-}" ]]; then | |
| # Allow callers to inject additional volume args, e.g. EXTRA_VOLUME_ARGS='-v /host/path:/container/path' | |
| VOLUME_ARGS+=(${EXTRA_VOLUME_ARGS}) | |
| fi |
| # run-unittest-jax: | ||
| # env: | ||
| # PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax | ||
| # needs: [code-lint] | ||
| # runs-on: [primus-lm-cicd-jax-8t8mh] | ||
| # steps: | ||
| # - run: echo "🎉 Begin Primus-Turbo Checkout." | ||
| # - name: Set commit hash to env | ||
| # run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV | ||
| # - name: Checkout Repo Primus-Turbo | ||
| # uses: actions/checkout@v4 | ||
| # with: | ||
| # repository: AMD-AIG-AIMA/Primus-Turbo | ||
| # submodules: "recursive" | ||
| # path: Primus-Turbo | ||
| # ref: ${{ env.PRIMUS_TURBO_COMMIT }} | ||
| # - run: echo "Begin Primus-Turbo Install." | ||
| # - name: Install Primus-Turbo | ||
| # run: | | ||
| # mv Primus-Turbo /tmp/ | ||
| # echo "Primus-Turbo dir: /tmp/Primus-Turbo" | ||
| # git config --global --add safe.directory /tmp/Primus-Turbo | ||
| # cd /tmp/Primus-Turbo | ||
| # start_time=$(date +%s) | ||
| # echo "✅ [Pip install requirements] started at: $(date)" |
There was a problem hiding this comment.
This PR comments out the entire run-unittest-jax job, and there doesn’t appear to be any other workflow job running the --jax unit tests after this change. If JAX support is still maintained, this will remove CI coverage and allow regressions to land. Consider fixing the runner selection / installation issue and keeping the job enabled, or replacing it with an equivalent JAX test job.
| # run-unittest-jax: | |
| # env: | |
| # PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax | |
| # needs: [code-lint] | |
| # runs-on: [primus-lm-cicd-jax-8t8mh] | |
| # steps: | |
| # - run: echo "🎉 Begin Primus-Turbo Checkout." | |
| # - name: Set commit hash to env | |
| # run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV | |
| # - name: Checkout Repo Primus-Turbo | |
| # uses: actions/checkout@v4 | |
| # with: | |
| # repository: AMD-AIG-AIMA/Primus-Turbo | |
| # submodules: "recursive" | |
| # path: Primus-Turbo | |
| # ref: ${{ env.PRIMUS_TURBO_COMMIT }} | |
| # - run: echo "Begin Primus-Turbo Install." | |
| # - name: Install Primus-Turbo | |
| # run: | | |
| # mv Primus-Turbo /tmp/ | |
| # echo "Primus-Turbo dir: /tmp/Primus-Turbo" | |
| # git config --global --add safe.directory /tmp/Primus-Turbo | |
| # cd /tmp/Primus-Turbo | |
| # start_time=$(date +%s) | |
| # echo "✅ [Pip install requirements] started at: $(date)" | |
| run-unittest-jax: | |
| env: | |
| PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax | |
| needs: [code-lint] | |
| runs-on: [primus-lm-cicd-jax-8t8mh] | |
| steps: | |
| - name: Begin JAX unit tests | |
| run: echo "🎉 Begin JAX unit tests in ${PRIMUS_WORKDIR}" | |
| - name: Set commit hash to env | |
| run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV | |
| - name: Checkout repository | |
| uses: actions/checkout@v4 | |
| - name: Set up Python | |
| uses: actions/setup-python@v5 | |
| with: | |
| python-version: "3.10" | |
| - name: Install dependencies (including JAX extras) | |
| run: | | |
| python -m pip install --upgrade pip | |
| if [ -f requirements.txt ]; then pip install -r requirements.txt; fi | |
| # Attempt to install JAX-related extras if defined; ignore failures | |
| pip install -e ".[jax,test]" || pip install -e ".[jax]" || true | |
| - name: Run JAX unit tests | |
| run: | | |
| pytest --jax |
| --profile_step_end 7 \ | ||
| --profile_step_start 6 \ |
There was a problem hiding this comment.
TRAIN_ITERS is set to 5, but --profile_step_start 6 / --profile_step_end 7 are outside the iteration range. If profiling is enabled (PROFILE=True), this likely results in no profiling data being captured. Consider tying the profile step range to TRAIN_ITERS (or lowering these defaults) so the script works when toggling profiling on.
| --profile_step_end 7 \ | |
| --profile_step_start 6 \ | |
| --profile_step_end $TRAIN_ITERS \ | |
| --profile_step_start 1 \ |
| --profile $PROFILE \ | ||
| --use_pytorch_profiler $PROFILE \ | ||
| --profile_step_end 7 \ | ||
| --profile_step_start 6 \ |
There was a problem hiding this comment.
TRAIN_ITERS is set to 5, but --profile_step_start 6 / --profile_step_end 7 are outside the iteration range. If profiling is enabled (PROFILE=True), this likely results in no profiling data being captured. Consider tying the profile step range to TRAIN_ITERS (or lowering these defaults) so the script works when toggling profiling on.
| # unset NCCL_IB_GID_INDEX | ||
| export NCCL_IB_GID_INDEX=1 | ||
| # export NCCL_IB_ROCE_VERSION_NUM=2 | ||
| export NCCL_MAX_P2P_CHANNELS=56 | ||
| # export NCCL_IB_TC=104 | ||
| # export NCCL_IB_FIFO_TC=192 | ||
| export NCCL_IB_TC=41 | ||
| export NCCL_IB_FIFO_TC=185 |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
| # ======================== Configuration ======================== | ||
| NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024) | ||
| DATA_DIR=${DATA_DIR:-"/shared/c4"} | ||
| PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"} |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
| log_info(f"Building Emerging Optimizers in {emerging_optimizers_path}") | ||
| ret = subprocess.run(["pip", "install", "-e", str(emerging_optimizers_path)], check=True) | ||
| ret = subprocess.run( | ||
| ["pip", "install", "--no-build-isolation", "-e", str(emerging_optimizers_path)], check=True |
| # run-unittest-jax: | ||
| # env: | ||
| # PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax | ||
| # needs: [code-lint] | ||
| # runs-on: [primus-lm-cicd-jax-8t8mh] | ||
| # steps: | ||
| # - run: echo "🎉 Begin Primus-Turbo Checkout." | ||
| # - name: Set commit hash to env | ||
| # run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV | ||
| # - name: Checkout Repo Primus-Turbo | ||
| # uses: actions/checkout@v4 | ||
| # with: | ||
| # repository: AMD-AIG-AIMA/Primus-Turbo | ||
| # submodules: "recursive" | ||
| # path: Primus-Turbo | ||
| # ref: ${{ env.PRIMUS_TURBO_COMMIT }} | ||
| # - run: echo "Begin Primus-Turbo Install." | ||
| # - name: Install Primus-Turbo | ||
| # run: | | ||
| # mv Primus-Turbo /tmp/ | ||
| # echo "Primus-Turbo dir: /tmp/Primus-Turbo" | ||
| # git config --global --add safe.directory /tmp/Primus-Turbo | ||
| # cd /tmp/Primus-Turbo | ||
| # start_time=$(date +%s) | ||
| # echo "✅ [Pip install requirements] started at: $(date)" |
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" | ||
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" | ||
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" |
| ${SLURM_TIME:+--time="${SLURM_TIME}"} \ | ||
| ${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \ | ||
| ${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \ |
| moe_token_dispatcher_type: alltoall | ||
|
|
||
| moe_grouped_gemm: true |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 13 comments.
Comments suppressed due to low confidence (1)
primus/configs/models/megatron/qwen2.5_base.yaml:34
- This YAML now defines
moe_token_dispatcher_typeandmoe_grouped_gemmtwice (once around lines 26–28 and again around lines 32–35). Duplicate keys in YAML are ambiguous (parsers typically take the last value) and can hide configuration mistakes. Please remove the redundant earlier entries or consolidate them into a single block.
moe_token_dispatcher_type: alltoall
moe_grouped_gemm: true
apply_rope_fusion: true
masked_softmax_fusion: false
moe_token_dispatcher_type: alltoall
moe_grouped_gemm: true
moe_use_legacy_grouped_gemm: false
| --disable_wandb True \ | ||
| --disable_tensorboard True \ | ||
| "$EXTRA_ARGS" \ | ||
| 2>&1 | tee output/$PRIMUS_TEAM/$PRIMUS_USER/$PRIMUS_EXP_NAME/log.txt |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
| srun -N "${NNODES}" \ | ||
| --exclusive \ | ||
| --export ALL \ | ||
| ${NODE_LIST:+--nodelist="${NODE_LIST}"} \ | ||
| --ntasks-per-node=1 \ | ||
| ${SLURM_TIME:+--time="${SLURM_TIME}"} \ | ||
| ${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \ | ||
| ${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \ |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | ||
|
|
| # run-unittest-jax: | ||
| # env: | ||
| # PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/jax | ||
| # needs: [code-lint] | ||
| # runs-on: [primus-lm-cicd-jax-8t8mh] | ||
| # steps: |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key |
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" | ||
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" | ||
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" |
| export NCCL_IB_TC=41 | ||
| export NCCL_IB_FIFO_TC=185 |
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | ||
|
|
Remove the --debug gate so users always see the complete stack trace alongside the one-line summary, reducing back-and-forth when diagnosing failures. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 159 out of 159 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (13)
runner/primus-cli-container.sh:1
- This now hard-fails when neither docker nor podman is present, even in dry-run mode. The previous logic explicitly allowed a mock runtime during dry-runs; without that,
--dry-runbecomes unusable on hosts lacking container tooling. Consider restoring the dry-run fallback (mock runtime) or skipping runtime validation whenDRY_RUN_MODE==true.
primus/modules/trainer/megatron/trainer.py:1 - The new defaulting logic does not handle the common case where
args.moe_layer_freqexists but isNone. In that case, the value remainsNoneand downstream code expecting an int/list can break. Suggest setting the default whengetattr(args, \"moe_layer_freq\", None) is None(not only when the attribute is missing).
runner/helpers/envs/base_env.sh:1 - The comments and defaults are now inconsistent:
HSA_NO_SCRATCH_RECLAIMis defaulting to 1, but the comment claims setting it to 0 prevents core dumps; similarly the TE comment says to disable v3 due to accuracy issues, but the default is now 1 (enabled). Please update the comments to match the new defaults/behavior, or revert the defaults to align with the documented guidance.
runner/helpers/envs/base_env.sh:1 - The comments and defaults are now inconsistent:
HSA_NO_SCRATCH_RECLAIMis defaulting to 1, but the comment claims setting it to 0 prevents core dumps; similarly the TE comment says to disable v3 due to accuracy issues, but the default is now 1 (enabled). Please update the comments to match the new defaults/behavior, or revert the defaults to align with the documented guidance.
start_training_dsv3.sh:1 - This script exports
SLURM_NODELISTfour times (only the last value takes effect), and one intermediate value contains-1557inside the bracket expression, which looks like an invalid node/range token. Please remove the redundant exports and keep a single validSLURM_NODELIST(or make the selection conditional via an argument/env var).
start_training_dsv3_4layers_proxy.sh:1 - The first
PRIMUS_EXP_NAMEassignment is immediately overwritten, so it will never be used. This makes it hard to tell which experiment naming convention is intended. Consider making the debug name conditional (e.g., gated on an env var) or removing the unused assignment.
primus/backends/megatron/patches/training_log/print_rank_last_patches.py:1 - Using
all_gatherallocates O(world_size) tensors and moves a list of values to every rank on each sample, which can add noticeable overhead at large scale. If you only need the maximum, consider anall_reduce(MAX)onr_used_tensor. If you also want the rank of the max, consider reducing a (value, rank) pair via a custom reduction or gather only on rank0.
primus/backends/megatron/patches/pp_warmup_patches.py:1 - This warmup implementation assumes a very specific model structure (
modeliterable of chunks, thenmodel_chunk.module.module.decoder.layers) and a specific per-layer forward signature/return shape (dummy_output, _ = layer.forward(...)). This is likely to break for non-GPT models (e.g., Mamba), encoder-decoder variants, or Megatron version differences, causing pre-train crashes whenpp_warmupis enabled. Recommend adding robust feature-detection (validate attribute chain + callable signature) and falling back to a safer warmup (e.g., a single dummy forward/backward through the normalforward_step_func), or gating the patch to known-compatible backend/model_types.
primus/backends/megatron/patches/pp_warmup_patches.py:1 - This warmup implementation assumes a very specific model structure (
modeliterable of chunks, thenmodel_chunk.module.module.decoder.layers) and a specific per-layer forward signature/return shape (dummy_output, _ = layer.forward(...)). This is likely to break for non-GPT models (e.g., Mamba), encoder-decoder variants, or Megatron version differences, causing pre-train crashes whenpp_warmupis enabled. Recommend adding robust feature-detection (validate attribute chain + callable signature) and falling back to a safer warmup (e.g., a single dummy forward/backward through the normalforward_step_func), or gating the patch to known-compatible backend/model_types.
runner/helpers/hooks/03_enable_ainic.sh:1 - These settings are now hard-coded and will override any values a user may have set in the environment. Previously the script used
${VAR:-default}patterns, which allowed overrides for cluster-specific tuning. Consider restoring the override-friendly defaults (e.g.,NCCL_IB_TC=${NCCL_IB_TC:-41}etc.) so operators can tune without editing the hook.
primus/backends/maxtext/patches/max_utils_patches.py:1 - This monkey-patches
jax.distributed.initializeglobally and does not restore it. In long-lived processes (or if multiple backends/modules are initialized in one process), this can leak behavior across unrelated code and make debugging version-specific JAX issues harder. Consider restoring the original function after MaxText initialization (e.g., patch only within the initialization call scope) or attaching the wrapper in a way that can be cleanly undone during teardown.
primus/cli/main.py:1 - The exception handler now always prints the full traceback via
traceback.print_exc(), then also prints a summarized error line. This changes the CLI behavior from the previous--debug-gated verbosity and can significantly increase log volume (and potentially expose internal paths) in production runs. Consider restoring the--debuggating (or an env flag) for full tracebacks, while keeping the concise error+location output for non-debug runs.
primus/cli/main.py:1 - The exception handler now always prints the full traceback via
traceback.print_exc(), then also prints a summarized error line. This changes the CLI behavior from the previous--debug-gated verbosity and can significantly increase log volume (and potentially expose internal paths) in production runs. Consider restoring the--debuggating (or an env flag) for full tracebacks, while keeping the concise error+location output for non-debug runs.
add opt config --------- Co-authored-by: HuangWei-95 <weihuan@amd.com>
Co-authored-by: HuangWei-95 <weihuan@amd.com>
Added configuration modifications to qwen3_30B, dated March 24th.
| import megatron.core.pipeline_parallel as pp_module | ||
| import megatron.training.training as training_module | ||
|
|
||
| import primus.modules.trainer.megatron.utils as utils |
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" | ||
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556]" # -1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" | ||
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,-1557,1561,1579,1583]" #1585,1588,1592,1596,1606,1627-1629,1650,1659-1660,1678]" | ||
| export SLURM_NODELIST="uswslocpm2m-106-[030-031,038-039,050,063,069,225,942,1531-1532,1536,1547,1549,1554,1556-1557,1561,1579,1583,1585,1588,1592,1596,1606,1627-1629,1683-1684,1691,1697]" |
There was a problem hiding this comment.
This script exports SLURM_NODELIST four times in a row; the first three assignments are dead code and one contains a malformed token (...,1536,-1557,1561,...). Keeping only a single SLURM_NODELIST (or making it an input override) will avoid confusion and accidental invalid nodelists.
| 8) | ||
| FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*7|t*8|t*8|t*8|t*8|t*8|t*7|t*7,L'") | ||
| ;; | ||
| 16) | ||
| FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*3|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*4|t*3|t*3,L'") | ||
| ;; | ||
| 32) | ||
| FEATURE_ARGS+=("--pipeline_model_parallel_layout" "'Et*1|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*1|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*2|t*1,L'") | ||
| ;; |
There was a problem hiding this comment.
FEATURE_ARGS values include embedded single quotes (e.g., "'Et*7|...|...,L'"). Because the value is already quoted by Bash, those inner quotes will be passed literally to the CLI and may break pipeline_model_parallel_layout parsing. Pass the layout string without the extra '...' wrapper.
| srun -N "${NNODES}" \ | ||
| --exclusive \ | ||
| --export ALL \ | ||
| ${NODE_LIST:+--nodelist="${NODE_LIST}"} \ | ||
| --ntasks-per-node=1 \ | ||
| ${SLURM_TIME:+--time="${SLURM_TIME}"} \ | ||
| ${SLURM_NODELIST:+--nodelist="${SLURM_NODELIST}"} \ | ||
| ${SLURM_PARTITION:+--partition="${SLURM_PARTITION}"} \ |
There was a problem hiding this comment.
This srun invocation now uses SLURM_TIME, SLURM_NODELIST, and SLURM_PARTITION, but the script’s --help section (not shown in this hunk) still documents NODE_LIST. Please update the help text to match the new env var names/options to avoid user confusion.
| args = get_megatron_args() | ||
| if getattr(args, "pp_warmup", True): | ||
| run_pp_warmup(model, config, args, optimizer, get_timers()) | ||
| return original_train( |
There was a problem hiding this comment.
getattr(args, "pp_warmup", True) defaults to True, so PP warmup will run even when the flag is absent. This is risky because warmup does extra forward/backward work and may break non-GPT models. Default this to False (or require the arg to be explicitly set), and add a one-time guard (e.g., closure flag) so warmup truly runs only once as stated in the docstring.
| # Validate container runtime (docker/podman) | ||
| # In dry-run mode, allow mock runtime if docker/podman not available | ||
| if [[ "$DRY_RUN_MODE" == "true" ]]; then | ||
| if command -v podman >/dev/null 2>&1; then | ||
| export CONTAINER_RUNTIME="podman" | ||
| elif command -v docker >/dev/null 2>&1; then | ||
| export CONTAINER_RUNTIME="docker" | ||
| else | ||
| # Mock runtime for dry-run testing | ||
| export CONTAINER_RUNTIME="docker" | ||
| LOG_INFO_RANK0 "[container] Using mock container runtime for dry-run (no docker/podman found)" | ||
| fi | ||
| if command -v docker >/dev/null 2>&1; then | ||
| export CONTAINER_RUNTIME="docker" | ||
| elif command -v podman >/dev/null 2>&1; then | ||
| export CONTAINER_RUNTIME="podman" | ||
| else | ||
| validate_container_runtime | ||
| LOG_ERROR "[container] Neither docker nor podman is available in PATH." | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
This container-runtime validation now hard-fails when neither docker nor podman is on PATH, even in --dry-run mode. That defeats dry-run usage on login/CI hosts without container tooling. Consider keeping the previous behavior: in dry-run allow a mock/default runtime (or skip runtime validation entirely) while still validating in real runs.
| # Note: Disable v3 due to accuracy issues. Will fix after TE version 2.1. | ||
| export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-0} | ||
| export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-1} |
There was a problem hiding this comment.
The comment says to disable v3 due to accuracy issues, but the default is now set to NVTE_CK_USES_BWD_V3:-1 (enabled). Either update the comment to match the new default rationale, or keep the default disabled (:-0) and enable only via explicit opt-in.
| # When pipeline parallelism (PP) is enabled, memory usage can vary across ranks. | ||
| # Therefore, we report the maximum ROCm memory usage across all ranks. | ||
| r_used_tensor = torch.tensor([r_used], device="cuda", dtype=torch.int64) | ||
| world_size = torch.distributed.get_world_size() | ||
| gathered_r_used = [torch.zeros_like(r_used_tensor) for _ in range(world_size)] | ||
| torch.distributed.all_gather(gathered_r_used, r_used_tensor) | ||
|
|
||
| total_r_used = [t.item() for t in gathered_r_used] | ||
| log_rank_0(f"total_r_used: {[round(r_used / 1024 ** 3, 2) for r_used in total_r_used]}") | ||
| max_r_used = max(total_r_used) | ||
| max_rank = total_r_used.index(max_r_used) |
There was a problem hiding this comment.
Collecting per-rank ROCm memory via all_gather and logging the full total_r_used list can add noticeable overhead and produce very large logs at scale. Consider using a cheaper collective (e.g., reduce/max) and only logging the aggregate (or gate the per-rank list behind a debug flag).
| raise ValueError( | ||
| f'When using recompute_layer_ids, recompute_granuarlity: {args.recompute_granularity} must be "full"' | ||
| ) |
There was a problem hiding this comment.
Typo in the error message: recompute_granuarlity should be recompute_granularity (and matching spelling makes grep/debugging easier).
| set -x | ||
|
|
||
| export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token | ||
| export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key |
There was a problem hiding this comment.
set -x will echo all commands, including the HF/W&B token exports. This can leak credentials into logs; disable xtrace by default or only enable it after sensitive env vars are set (or guard it behind a DEBUG flag).
| export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token | ||
| export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key |
There was a problem hiding this comment.
The default values for HF_TOKEN/WANDB_API_KEY include single quotes inside the parameter expansion (e.g. :-'your_hf_token'), so the resulting env var value will literally contain quotes. That will break authentication unless users manually remove them; use an unquoted placeholder (or empty default) instead.
| export HF_TOKEN="your_hf_token" # make it your own hf token | ||
| export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key | ||
| export DOCKER_IMAGE="docker.io/tasimage/primus:pr-563-ainic" |
There was a problem hiding this comment.
Hardcoding tokens in the script (export HF_TOKEN="your_hf_token", etc.) makes it easy to accidentally commit real credentials after editing. Prefer the ${VAR:-...} pattern (or require them to be provided externally) and avoid writing secrets into the repo.
| # Select and instantiate the tokenizer. | ||
| if args.tokenizer_type in CUSTOM_TOKENIZER_TYPES: | ||
| tokenizer = _HuggingFaceTokenizer(args.tokenizer_model) | ||
| tokenizer = _HuggingFaceTokenizer(args.tokenizer_model, trust_remote_code=True) | ||
| else: |
There was a problem hiding this comment.
Enabling trust_remote_code=True for all HuggingFace tokenizers will execute arbitrary code from remote model repos. This is a significant supply-chain risk; make this opt-in via a config/CLI flag (default False) or restrict it to the specific tokenizers that require it.
| # Prevent scratch memory from being reclaimed to stabilize large memory usage | ||
| # NOTE: Must disable scratch reclaim to avoid MoE training crash on AMD GPUs | ||
| # Setting this to 0 prevents core dumps when using Mixture-of-Experts (MoE) models | ||
| export HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM:-0} | ||
| export HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM:-1} | ||
|
|
There was a problem hiding this comment.
The comment says “Setting this to 0 prevents core dumps…”, but the default is now HSA_NO_SCRATCH_RECLAIM:-1. Either the comment is now incorrect or the default should remain 0; please align the comment with the actual behavior to avoid misconfiguration.
| export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token | ||
| export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key |
There was a problem hiding this comment.
The default values for HF_TOKEN/WANDB_API_KEY include single quotes inside the parameter expansion, so the env var value will literally contain quotes. This will break authentication unless edited; use an unquoted placeholder (or empty default) instead.
| # Note: Disable v3 due to accuracy issues. Will fix after TE version 2.1. | ||
| export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-0} | ||
| export NVTE_CK_USES_BWD_V3=${NVTE_CK_USES_BWD_V3:-1} | ||
|
|
There was a problem hiding this comment.
The comment states “Disable v3 due to accuracy issues”, but the default was changed to NVTE_CK_USES_BWD_V3:-1, which enables it. Please either revert the default or update the comment to reflect the intended default/risks.
Add training configurations for the Minimax 2.5 and GLAM5 models, including single-node and 8-node/16-node training scripts. Co-authored-by: botaohu001 <Botao.Hu@amd.com>
| def new_post_init(self): | ||
| tmp = getattr(self, "recompute_granularity", None) | ||
| self.recompute_granularity = None | ||
| orig_post_init(self) | ||
| self.recompute_granularity = tmp | ||
| validate_specified_recompute_layers(config_mod.TransformerConfig, args) | ||
|
|
There was a problem hiding this comment.
In new_post_init(), validate_specified_recompute_layers() is called with config_mod.TransformerConfig (the class) instead of the config instance (self). This will validate/mutate class attributes rather than the actual run’s config and can raise attribute errors or silently ignore the user-provided recompute_layer_ids. Pass the instance (self) into validate_specified_recompute_layers so it validates the effective config for this run.
| if self.config.packing and self.config.dataset_type != "synthetic": | ||
| if decoder_segment_ids is None: | ||
| decoder_segment_ids = jnp.ones(shape=query.shape[:2], dtype=jnp.int32) | ||
| attn_mask = SequenceDescriptor.from_segment_ids_and_pos( | ||
| segment_ids=decoder_segment_ids, segment_pos=None | ||
| ) | ||
| qkv_layout = "THD_THD_THD" # 'T3HD', 'THD_T2HD' or 'THD_THD_THD' |
There was a problem hiding this comment.
PrimusAttentionOp.cudnn_flash_attention references SequenceDescriptor but it is never imported in this module, which will raise NameError when packing is enabled. Import SequenceDescriptor from the appropriate MaxText/TE module (matching the legacy MaxText API) or avoid referencing it here.
| # Validate container runtime (docker/podman) | ||
| # In dry-run mode, allow mock runtime if docker/podman not available | ||
| if [[ "$DRY_RUN_MODE" == "true" ]]; then | ||
| if command -v podman >/dev/null 2>&1; then | ||
| export CONTAINER_RUNTIME="podman" | ||
| elif command -v docker >/dev/null 2>&1; then | ||
| export CONTAINER_RUNTIME="docker" | ||
| else | ||
| # Mock runtime for dry-run testing | ||
| export CONTAINER_RUNTIME="docker" | ||
| LOG_INFO_RANK0 "[container] Using mock container runtime for dry-run (no docker/podman found)" | ||
| fi | ||
| if command -v docker >/dev/null 2>&1; then | ||
| export CONTAINER_RUNTIME="docker" | ||
| elif command -v podman >/dev/null 2>&1; then | ||
| export CONTAINER_RUNTIME="podman" |
There was a problem hiding this comment.
Container runtime detection now hard-fails if neither docker nor podman is in PATH. This also blocks --dry-run mode, which previously could be used for command-generation/testing on systems without a container runtime. Consider skipping runtime validation (or keeping the prior mock-runtime fallback) when DRY_RUN_MODE=true.
| except SystemExit: | ||
| raise | ||
| except Exception: | ||
| # Torchrun/elastic can sometimes suppress worker tracebacks. | ||
| # Print here so users can see the real root cause. | ||
| if getattr(args, "debug", False): | ||
| traceback.print_exc() | ||
| else: | ||
| print(f"[Primus] Error: {traceback.format_exc().splitlines()[-1]}", file=sys.stderr) | ||
| print("[Primus] Re-run with --debug for full stack trace.", file=sys.stderr) | ||
| traceback.print_exc() |
There was a problem hiding this comment.
The CLI now always prints a full traceback (traceback.print_exc()) on any exception, regardless of the --debug flag. This makes --debug ineffective and can produce extremely noisy logs (especially under torchrun/elastic). Restore the prior behavior: print full traceback only when --debug is set, otherwise emit a concise error with a hint to re-run with --debug.
| export HF_TOKEN="${HF_TOKEN:-'your_hf_token'}" # make it your own hf token | ||
| export WANDB_API_KEY="${WANDB_API_KEY:-'your_wandb_api_key'}" # make it your own wandb api key |
There was a problem hiding this comment.
HF_TOKEN/WANDB_API_KEY defaults include single quotes inside the parameter expansion (e.g., ${HF_TOKEN:-'your_hf_token'}), which will literally set the env var value to a quoted string when unset. Remove the inner single quotes so the default is a usable placeholder value.
| export PRIMUS_EXP_NAME=minimax_m2.5-pretrain-platform_$PLATFORM-layers_$PRIMUS_TOTAL_LAYERS-type_$PRETRAIN_TYPE-mbs_$MBS-gbs_$GBS-legacygg_$LEGACY_GG | ||
| export PRIMUS_EXP_NAME=debug_minimax_m2.5_4layers-type_$PRETRAIN_TYPE-legacygg_$LEGACY_GG-turbogg_$TURBO_GROUPED_MLP |
There was a problem hiding this comment.
PRIMUS_EXP_NAME is assigned twice; the second assignment overwrites the first and forces a debug name. If this is meant to be optional, gate it behind a flag or comment it out; otherwise remove the redundant assignment so runs get the expected experiment naming/output directory.
| export PRIMUS_EXP_NAME=dsv3-pretrain-platform_$PLATFORM-layers_$PRIMUS_TOTAL_LAYERS-type_$PRETRAIN_TYPE-mbs_$MBS-gbs_$GBS-legacygg_$LEGACY_GG | ||
| export PRIMUS_EXP_NAME=debug_4layers-type_$PRETRAIN_TYPE-legacygg_$LEGACY_GG-turbogg_$TURBO_GROUPED_MLP |
There was a problem hiding this comment.
PRIMUS_EXP_NAME is assigned twice; the second assignment overwrites the first and forces a debug name. If this is meant to be optional, gate it behind a flag or comment it out; otherwise remove the redundant assignment so runs get the expected experiment naming/output directory.
| elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then | ||
| export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09" | ||
| EXTRA_ARGS="--use_rocm_mem_info_iters None" | ||
| else |
There was a problem hiding this comment.
EXTRA_ARGS is assigned as a single string containing a space ("--use_rocm_mem_info_iters None") and later passed as "${EXTRA_ARGS}". When quoted, this becomes one CLI argument (with an embedded space) instead of two separate arguments, which will likely break argparse parsing. Use an array for EXTRA_ARGS (and expand it as "${EXTRA_ARGS[@]}"), or avoid quoting and ensure correct word-splitting.
| elif [ "$PLATFORM" = "B200" ] || [ "$PLATFORM" = "GB200" ]; then | ||
| export DOCKER_IMAGE="nvcr.io/nvidia/nemo:25.09" | ||
| EXTRA_ARGS="--use_rocm_mem_info_iters None" | ||
| else |
There was a problem hiding this comment.
EXTRA_ARGS is assigned as a single string containing a space ("--use_rocm_mem_info_iters None") and later passed as "${EXTRA_ARGS}". When quoted, this becomes one CLI argument (with an embedded space) instead of two separate arguments, which will likely break argparse parsing. Use an array for EXTRA_ARGS (and expand it as "${EXTRA_ARGS[@]}"), or avoid quoting and ensure correct word-splitting.
No description provided.