Conversation
…ice correctly for CUDA context creation / hypothetical memory allocations (NVIDIA#2710) Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
…NVIDIA#2723) Signed-off-by: John St John <jstjohn@nvidia.com> Signed-off-by: John St. John <jstjohn@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…#2852) Signed-off-by: John St. John <jstjohn@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: root <root@gpu-h100-0348.cm.cluster> Co-authored-by: root <root@gpu-h100-0193.cm.cluster> Co-authored-by: root <root@gpu-h100-0082.cm.cluster> Co-authored-by: root <root@gpu-h100-0495.cm.cluster> Co-authored-by: William Dykas <wdykas@cw-pdx-cs-001-vscode-02.cm.cluster> Co-authored-by: root <root@gpu-h100-0213.cm.cluster> Co-authored-by: root <root@gpu-h100-0435.cm.cluster> Co-authored-by: root <root@gpu-h100-0188.cm.cluster> Co-authored-by: root <root@gpu-h100-0032.cm.cluster> Co-authored-by: root <root@gpu-h100-0023.cm.cluster> Co-authored-by: root <root@gpu-h100-0368.cm.cluster> Co-authored-by: root <root@gpu-h100-0203.cm.cluster> Co-authored-by: root <root@gpu-h100-0229.cm.cluster> Co-authored-by: root <root@gpu-h100-0123.cm.cluster> Co-authored-by: root <root@gpu-h100-0217.cm.cluster> Co-authored-by: root <root@gpu-h100-0496.cm.cluster> Co-authored-by: root <root@gpu-h100-0022.cm.cluster> Co-authored-by: root <root@gpu-h100-0176.cm.cluster> Co-authored-by: root <root@gpu-h100-0190.cm.cluster>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
…NVIDIA#2751) Co-authored-by: Philip Petrakian <pgpetrak@gmail.com>
…VIDIA#2738)" (NVIDIA#2884) Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: kunlunl <kunlunl@nvidia.com> Signed-off-by: jianbinc <shjwudp@gmail.com> Co-authored-by: jianbinc <shjwudp@gmail.com> Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>
…A#2794) Co-authored-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: shanmugamr1992 <shanmugamr1992@gmail.com> Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
This reverts commit ffbc43f.
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Rabeeh Mahabadi <rkarimimahab@nb-hel-cs-001-vscode-02.cm.cluster> Co-authored-by: Sanjeev Satheesh <sasatheesh@nvidia.com> Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
|
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
b982cb0 to
fc49325
Compare
a192ae0 to
eb2afc7
Compare
| run: | | ||
| set -e | ||
| BRANCH=release_v2.2_rocm | ||
| BRANCH=release_v2.10_rocm |
There was a problem hiding this comment.
Some of the changes were already merged in previous PR.
| push: true | ||
| load: true | ||
| # Also write image to disk so `docker run` does not need `docker pull`. | ||
| # On some runners dockerd cannot reach auth.docker.io (timeout); BuildKit push may still work. |
There was a problem hiding this comment.
We dropped push + load: true because it was flaky against Docker Hub; we still push for the registry/cache, and we put the same build into docker load via a local tar so docker run doesn’t need docker pull or load: true
There was a problem hiding this comment.
buildx failed with: ERROR: failed to build: failed to solve: error writing layer blob: failed to copy: unexpected status from PUT request to https://registry-1.docker.io/v2/rocm/megatron-lm-private/blobs/uploads/c2e050f3-4e70-4eec-bb23-08dfae343b0a?_state=Vhe_1OUGFU_y2OwxQAbuf73Pp7sRjmlL9_A1IGaNHgh7Ik5hbWUiOiJyb2NtL21lZ2F0cm9uLWxtLXByaXZhdGUiLCJVVUlEIjoiYzJlMDUwZjMtNGU3MC00ZWVjLWJiMjMtMDhkZmFlMzQzYjBhIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDI2LTAzLTE5VDIyOjUxOjAzLjY0MTczNjM2OVoifQ%3D%3D&digest=sha256%3A58ddf34583f8113d2742fc9b68af4b3336d0059e8dc77172f7808396273344ee: 400 Bad request"
| And FSDP-v2 is not supported with pipeline parallelism, expert parallelism, MCore's | ||
| distributed optimizer, gradient accumulation fusion and fp16. | ||
|
|
||
| To run with Megatron-LM HSDP enabled (Hybrid Sharded Data Parallel), use `ENABLE_HSDP=1` |
There was a problem hiding this comment.
HSDP Changes are from previous PRs on rocm_dev
| @@ -1,3 +1,6 @@ | |||
| # Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
All the changes for this file are from upstream.
Upstream runs the test in this file along with test_mfsdp_fully_shard.py. But in our case, there were issues because of corrupted state between the 2 tests. Hence, they were separated into 2 different files in core_0.15.0 IFU.
| ), "Expected no log on rank != 0 for experimental fn enable message" | ||
| else: | ||
|
|
||
| if safe_get_rank() == 0: |
| @pytest.mark.parametrize("tp_size", [2]) | ||
| @pytest.mark.parametrize("dp_overlap", [(True, True)]) | ||
| @pytest.mark.skipif(not cuda_graph_supported, reason=reason_for_no_cuda_graph) | ||
| @pytest.mark.failing_on_rocm(reason="CUDA graph capture bug on ROCm 7.1 https://github.com/ROCm/rccl/issues/2022") |
There was a problem hiding this comment.
Next steps after the IFU:
Upgrade CI to ROCm 7.2
Fix this test.
| git checkout hipcub_warp_threads_deprecation &&\ | ||
| git show --oneline -s &&\ | ||
| pip install --no-build-isolation --no-deps . | ||
| RUN pip install mamba-ssm --no-build-isolation |
There was a problem hiding this comment.
No need to use our custom Mamba repo anymore, the latest repo has the rocm7.0 fix!
| pip install --no-build-isolation . | ||
|
|
||
|
|
||
| RUN git clone https://github.com/NVIDIA-NeMo/Emerging-Optimizers.git &&\ |
There was a problem hiding this comment.
Works out of the box and all the unit tests using this repo pass. This repo brings in some custom optimizers such as muon.
| if Utils.rank == 0: | ||
| with TempNamedDir(tmp_dir, sync=False): | ||
| yield tmp_dir | ||
| if torch.distributed.is_initialized(): |
There was a problem hiding this comment.
Without barriers, rank0 deletes the tmp dir while other ranks are still working on it.
|
|
||
| # now weakref everything | ||
| if HAVE_TE_GRAPHS: | ||
| if HAVE_TE_WEAK_REF: |
There was a problem hiding this comment.
To fix
E NameError: name 'make_weak_ref' is not defined> return _CudagraphGlobalRecord.create_cudagraphs()
from TE.
Motivation
To fetch the latest changes from upstream Megatron-LM.
Commit: 'f4502eb1c92f77e0ed190cac00293e8ac192543b'
Test Result
Result: https://github.com/ROCm/Megatron-LM/pull/119/checks?check_run_id=69965357451
core_r0.15.0 result: https://github.com/ROCm/Megatron-LM/actions/runs/23548284464/job/68585244517
Submission Checklist