Skip to content

Sudhu/megatron ifu core 0.16.0 sync#119

Open
sudhu2k wants to merge 696 commits intorocm_devfrom
sudhu/Megatron-IFU-core_0.16.0_sync
Open

Sudhu/megatron ifu core 0.16.0 sync#119
sudhu2k wants to merge 696 commits intorocm_devfrom
sudhu/Megatron-IFU-core_0.16.0_sync

Conversation

@sudhu2k
Copy link
Copy Markdown
Collaborator

@sudhu2k sudhu2k commented Mar 3, 2026

Motivation

To fetch the latest changes from upstream Megatron-LM.
Commit: 'f4502eb1c92f77e0ed190cac00293e8ac192543b'

Test Result

Result: https://github.com/ROCm/Megatron-LM/pull/119/checks?check_run_id=69965357451
core_r0.15.0 result: https://github.com/ROCm/Megatron-LM/actions/runs/23548284464/job/68585244517

Submission Checklist

Phlip79 and others added 30 commits January 6, 2026 23:14
…ice correctly for CUDA context creation / hypothetical memory allocations (NVIDIA#2710)

Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
…NVIDIA#2723)

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St. John <jstjohn@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: root <root@gpu-h100-0348.cm.cluster>
Co-authored-by: root <root@gpu-h100-0193.cm.cluster>
Co-authored-by: root <root@gpu-h100-0082.cm.cluster>
Co-authored-by: root <root@gpu-h100-0495.cm.cluster>
Co-authored-by: William Dykas <wdykas@cw-pdx-cs-001-vscode-02.cm.cluster>
Co-authored-by: root <root@gpu-h100-0213.cm.cluster>
Co-authored-by: root <root@gpu-h100-0435.cm.cluster>
Co-authored-by: root <root@gpu-h100-0188.cm.cluster>
Co-authored-by: root <root@gpu-h100-0032.cm.cluster>
Co-authored-by: root <root@gpu-h100-0023.cm.cluster>
Co-authored-by: root <root@gpu-h100-0368.cm.cluster>
Co-authored-by: root <root@gpu-h100-0203.cm.cluster>
Co-authored-by: root <root@gpu-h100-0229.cm.cluster>
Co-authored-by: root <root@gpu-h100-0123.cm.cluster>
Co-authored-by: root <root@gpu-h100-0217.cm.cluster>
Co-authored-by: root <root@gpu-h100-0496.cm.cluster>
Co-authored-by: root <root@gpu-h100-0022.cm.cluster>
Co-authored-by: root <root@gpu-h100-0176.cm.cluster>
Co-authored-by: root <root@gpu-h100-0190.cm.cluster>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: kunlunl <kunlunl@nvidia.com>
Signed-off-by: jianbinc <shjwudp@gmail.com>
Co-authored-by: jianbinc <shjwudp@gmail.com>
Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Co-authored-by: shanmugamr1992 <shanmugamr1992@gmail.com>
Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-login-01.cm.cluster>
santhnm2 and others added 9 commits January 31, 2026 08:57
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Rabeeh Mahabadi <rkarimimahab@nb-hel-cs-001-vscode-02.cm.cluster>
Co-authored-by: Sanjeev Satheesh <sasatheesh@nvidia.com>
Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 18, 2026
@sudhu2k sudhu2k self-assigned this Mar 18, 2026
@github-actions github-actions bot removed the stale label Mar 19, 2026
@sudhu2k sudhu2k force-pushed the sudhu/Megatron-IFU-core_0.16.0_sync branch from b982cb0 to fc49325 Compare April 3, 2026 19:54
@sudhu2k sudhu2k force-pushed the sudhu/Megatron-IFU-core_0.16.0_sync branch from a192ae0 to eb2afc7 Compare April 4, 2026 17:56
@sudhu2k sudhu2k requested a review from wenchenvincent April 4, 2026 22:29
@sudhu2k sudhu2k marked this pull request as ready for review April 4, 2026 22:31
@sudhu2k sudhu2k changed the title [WIP] Sudhu/megatron ifu core 0.16.0 sync Sudhu/megatron ifu core 0.16.0 sync Apr 4, 2026
run: |
set -e
BRANCH=release_v2.2_rocm
BRANCH=release_v2.10_rocm
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the changes were already merged in previous PR.

push: true
load: true
# Also write image to disk so `docker run` does not need `docker pull`.
# On some runners dockerd cannot reach auth.docker.io (timeout); BuildKit push may still work.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dropped push + load: true because it was flaky against Docker Hub; we still push for the registry/cache, and we put the same build into docker load via a local tar so docker run doesn’t need docker pull or load: true

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread examples/llama/README.md
And FSDP-v2 is not supported with pipeline parallelism, expert parallelism, MCore's
distributed optimizer, gradient accumulation fusion and fp16.

To run with Megatron-LM HSDP enabled (Hybrid Sharded Data Parallel), use `ENABLE_HSDP=1`
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HSDP Changes are from previous PRs on rocm_dev

@@ -1,3 +1,6 @@
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes for this file are from upstream.
Upstream runs the test in this file along with test_mfsdp_fully_shard.py. But in our case, there were issues because of corrupted state between the 2 tests. Hence, they were separated into 2 different files in core_0.15.0 IFU.

), "Expected no log on rank != 0 for experimental fn enable message"
else:

if safe_get_rank() == 0:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up.

@pytest.mark.parametrize("tp_size", [2])
@pytest.mark.parametrize("dp_overlap", [(True, True)])
@pytest.mark.skipif(not cuda_graph_supported, reason=reason_for_no_cuda_graph)
@pytest.mark.failing_on_rocm(reason="CUDA graph capture bug on ROCm 7.1 https://github.com/ROCm/rccl/issues/2022")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next steps after the IFU:
Upgrade CI to ROCm 7.2
Fix this test.

Comment thread Dockerfile_rocm.ci
git checkout hipcub_warp_threads_deprecation &&\
git show --oneline -s &&\
pip install --no-build-isolation --no-deps .
RUN pip install mamba-ssm --no-build-isolation
Copy link
Copy Markdown
Collaborator Author

@sudhu2k sudhu2k Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use our custom Mamba repo anymore, the latest repo has the rocm7.0 fix!

Comment thread Dockerfile_rocm.ci
pip install --no-build-isolation .


RUN git clone https://github.com/NVIDIA-NeMo/Emerging-Optimizers.git &&\
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works out of the box and all the unit tests using this repo pass. This repo brings in some custom optimizers such as muon.

if Utils.rank == 0:
with TempNamedDir(tmp_dir, sync=False):
yield tmp_dir
if torch.distributed.is_initialized():
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without barriers, rank0 deletes the tmp dir while other ranks are still working on it.


# now weakref everything
if HAVE_TE_GRAPHS:
if HAVE_TE_WEAK_REF:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix
E NameError: name 'make_weak_ref' is not defined> return _CudagraphGlobalRecord.create_cudagraphs()
from TE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.