Sudhu/megatron ifu core 0.16.0 sync by sudhu2k · Pull Request #119 · ROCm/Megatron-LM

sudhu2k · 2026-03-03T01:25:55Z

Motivation

To fetch the latest changes from upstream Megatron-LM.
Commit: 'f4502eb1c92f77e0ed190cac00293e8ac192543b'

Test Result

Result: https://github.com/ROCm/Megatron-LM/pull/119/checks?check_run_id=69965357451
core_r0.15.0 result: https://github.com/ROCm/Megatron-LM/actions/runs/23548284464/job/68585244517

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

NVIDIA#2773)

…ice correctly for CUDA context creation / hypothetical memory allocations (NVIDIA#2710) Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

…NVIDIA#2723) Signed-off-by: John St John <jstjohn@nvidia.com> Signed-off-by: John St. John <jstjohn@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

…#2852) Signed-off-by: John St. John <jstjohn@nvidia.com>

Signed-off-by: Robin Zhang <robinz@nvidia.com>

Co-authored-by: root <root@gpu-h100-0348.cm.cluster> Co-authored-by: root <root@gpu-h100-0193.cm.cluster> Co-authored-by: root <root@gpu-h100-0082.cm.cluster> Co-authored-by: root <root@gpu-h100-0495.cm.cluster> Co-authored-by: William Dykas <wdykas@cw-pdx-cs-001-vscode-02.cm.cluster> Co-authored-by: root <root@gpu-h100-0213.cm.cluster> Co-authored-by: root <root@gpu-h100-0435.cm.cluster> Co-authored-by: root <root@gpu-h100-0188.cm.cluster> Co-authored-by: root <root@gpu-h100-0032.cm.cluster> Co-authored-by: root <root@gpu-h100-0023.cm.cluster> Co-authored-by: root <root@gpu-h100-0368.cm.cluster> Co-authored-by: root <root@gpu-h100-0203.cm.cluster> Co-authored-by: root <root@gpu-h100-0229.cm.cluster> Co-authored-by: root <root@gpu-h100-0123.cm.cluster> Co-authored-by: root <root@gpu-h100-0217.cm.cluster> Co-authored-by: root <root@gpu-h100-0496.cm.cluster> Co-authored-by: root <root@gpu-h100-0022.cm.cluster> Co-authored-by: root <root@gpu-h100-0176.cm.cluster> Co-authored-by: root <root@gpu-h100-0190.cm.cluster>

Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: Robin Zhang <robinz@nvidia.com>

…VIDIA#2752)

…NVIDIA#2751) Co-authored-by: Philip Petrakian <pgpetrak@gmail.com>

…VIDIA#2738)" (NVIDIA#2884) Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: kunlunl <kunlunl@nvidia.com> Signed-off-by: jianbinc <shjwudp@gmail.com> Co-authored-by: jianbinc <shjwudp@gmail.com> Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>

…A#2794) Co-authored-by: Xin Yao <xiny@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: shanmugamr1992 <shanmugamr1992@gmail.com> Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-login-01.cm.cluster>

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

This reverts commit ffbc43f.

Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

Co-authored-by: Rabeeh Mahabadi <rkarimimahab@nb-hel-cs-001-vscode-02.cm.cluster> Co-authored-by: Sanjeev Satheesh <sasatheesh@nvidia.com> Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

…offload` (NVIDIA#2874)

github-actions · 2026-03-18T02:16:54Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

sudhu2k · 2026-04-06T20:10:58Z

        run: |
          set -e
-          BRANCH=release_v2.2_rocm
+          BRANCH=release_v2.10_rocm


Some of the changes were already merged in previous PR.

sudhu2k · 2026-04-06T20:15:37Z

          push: true
-          load: true
+          # Also write image to disk so `docker run` does not need `docker pull`.
+          # On some runners dockerd cannot reach auth.docker.io (timeout); BuildKit push may still work.


We dropped push + load: true because it was flaky against Docker Hub; we still push for the registry/cache, and we put the same build into docker load via a local tar so docker run doesn’t need docker pull or load: true

buildx failed with: ERROR: failed to build: failed to solve: error writing layer blob: failed to copy: unexpected status from PUT request to https://registry-1.docker.io/v2/rocm/megatron-lm-private/blobs/uploads/c2e050f3-4e70-4eec-bb23-08dfae343b0a?_state=Vhe_1OUGFU_y2OwxQAbuf73Pp7sRjmlL9_A1IGaNHgh7Ik5hbWUiOiJyb2NtL21lZ2F0cm9uLWxtLXByaXZhdGUiLCJVVUlEIjoiYzJlMDUwZjMtNGU3MC00ZWVjLWJiMjMtMDhkZmFlMzQzYjBhIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDI2LTAzLTE5VDIyOjUxOjAzLjY0MTczNjM2OVoifQ%3D%3D&digest=sha256%3A58ddf34583f8113d2742fc9b68af4b3336d0059e8dc77172f7808396273344ee: 400 Bad request"

sudhu2k · 2026-04-06T20:16:20Z

 And FSDP-v2 is not supported with pipeline parallelism, expert parallelism, MCore's
 distributed optimizer, gradient accumulation fusion and fp16.

+To run with Megatron-LM HSDP enabled (Hybrid Sharded Data Parallel), use `ENABLE_HSDP=1`


HSDP Changes are from previous PRs on rocm_dev

sudhu2k · 2026-04-06T20:23:30Z

@@ -1,3 +1,6 @@
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


All the changes for this file are from upstream.
Upstream runs the test in this file along with test_mfsdp_fully_shard.py. But in our case, there were issues because of corrupted state between the 2 tests. Hence, they were separated into 2 different files in core_0.15.0 IFU.

sudhu2k · 2026-04-06T20:24:12Z

-            ), "Expected no log on rank != 0 for experimental fn enable message"
-    else:
+
+    if safe_get_rank() == 0:


Cleaned up.

sudhu2k · 2026-04-06T20:24:41Z

    @pytest.mark.parametrize("tp_size", [2])
    @pytest.mark.parametrize("dp_overlap", [(True, True)])
    @pytest.mark.skipif(not cuda_graph_supported, reason=reason_for_no_cuda_graph)
+    @pytest.mark.failing_on_rocm(reason="CUDA graph capture bug on ROCm 7.1 https://github.com/ROCm/rccl/issues/2022")


Next steps after the IFU:
Upgrade CI to ROCm 7.2
Fix this test.

sudhu2k · 2026-04-06T20:25:24Z

-    git checkout hipcub_warp_threads_deprecation &&\
-    git show --oneline -s &&\
-    pip install --no-build-isolation --no-deps .
+RUN pip install mamba-ssm --no-build-isolation


No need to use our custom Mamba repo anymore, the latest repo has the rocm7.0 fix!

sudhu2k · 2026-04-06T20:26:07Z

    pip install --no-build-isolation .

+
+RUN git clone https://github.com/NVIDIA-NeMo/Emerging-Optimizers.git &&\


Works out of the box and all the unit tests using this repo pass. This repo brings in some custom optimizers such as muon.

sudhu2k · 2026-04-06T20:27:03Z

    if Utils.rank == 0:
        with TempNamedDir(tmp_dir, sync=False):
            yield tmp_dir
+            if torch.distributed.is_initialized():


Without barriers, rank0 deletes the tmp dir while other ranks are still working on it.

sudhu2k · 2026-04-06T20:31:17Z


        # now weakref everything
-        if HAVE_TE_GRAPHS:
+        if HAVE_TE_WEAK_REF:


To fix
E NameError: name 'make_weak_ref' is not defined> return _CudagraphGlobalRecord.create_cudagraphs()
from TE.

Phlip79 and others added 30 commits January 6, 2026 23:14

[docs] Update oncall doc (NVIDIA#2822)

8a59fb5

Make default for rerun_mode=disabled not terminate with non-fatal rer… (

c2327e7

NVIDIA#2773)

Bugfix: ensure spawned persistent checkpoint worker sets its CUDA dev…

5950971

…ice correctly for CUDA context creation / hypothetical memory allocations (NVIDIA#2710) Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

Implementation of a more flexible optimizer/scheduler override system (…

41cedd6

…NVIDIA#2723) Signed-off-by: John St John <jstjohn@nvidia.com> Signed-off-by: John St. John <jstjohn@nvidia.com>

chore: rotate oncall schedule

64bc482

ci(fix): PyPI upload (NVIDIA#2843)

144049d

Signed-off-by: oliver könig <okoenig@nvidia.com>

ci(fix): Don't fail on empty var (NVIDIA#2850)

ed5bc5c

Signed-off-by: oliver könig <okoenig@nvidia.com>

Add RL support for MOEs (NVIDIA#2742)

c64f227

ci(fix): GH release version tag (NVIDIA#2854)

ae1a2f1

Signed-off-by: oliver könig <okoenig@nvidia.com>

ci(hotfix): Disable flaky test

7e5e16b

Signed-off-by: oliver könig <okoenig@nvidia.com>

Reduce the scope of the side stream around DDP initialization (NVIDIA…

4ba9f46

…#2852) Signed-off-by: John St. John <jstjohn@nvidia.com>

Manually update first oncall rotation (NVIDIA#2855)

49dfee2

Remove flaky iteration time functional test (NVIDIA#2862)

5fa42ec

Signed-off-by: Robin Zhang <robinz@nvidia.com>

build: Bump jet-client (NVIDIA#2876)

8d6c604

Signed-off-by: oliver könig <okoenig@nvidia.com>

Change oncall team name (NVIDIA#2861)

8b10a64

Dynamic Inference | Evict and re-compute context requests. (NVIDIA#2738)

c8ac1fe

Fix CUDA RNG Tracker (NVIDIA#2641)

965cfd3

Signed-off-by: Robin Zhang <robinz@nvidia.com>

[main] feat(moe): Support attention output gate for Qwen3-Next (3/4) (N…

65dccab

…VIDIA#2752)

[main] feat(moe): Support moe shared expert gate for Qwen3-Next (2/4) (…

bddd0a8

…NVIDIA#2751) Co-authored-by: Philip Petrakian <pgpetrak@gmail.com>

[docs] Fix docs and add generation doc (NVIDIA#2882)

0b5be24

Revert "Dynamic Inference | Evict and re-compute context requests. (N…

43b4471

…VIDIA#2738)" (NVIDIA#2884) Signed-off-by: Charlie Truong <chtruong@nvidia.com>

FP8 params support for megatron-fsdp (MXFP8/Blockwise) (NVIDIA#2239)

8f2f700

Signed-off-by: kunlunl <kunlunl@nvidia.com> Signed-off-by: jianbinc <shjwudp@gmail.com> Co-authored-by: jianbinc <shjwudp@gmail.com> Co-authored-by: Cory Ye <44509866+cspades@users.noreply.github.com>

docs: fix broken images, links, and typos across documentation (NVIDI…

ed29157

…A#2794) Co-authored-by: Xin Yao <xiny@nvidia.com>

ci(fix): Release version (NVIDIA#2873)

424a26d

Signed-off-by: oliver könig <okoenig@nvidia.com>

Assign mcore-oncall instead of user (NVIDIA#2879)

371ee52

tests: Disable Mamba MOE model test after 43b4471 (NVIDIA#2886)

74db1ce

Signed-off-by: oliver könig <okoenig@nvidia.com>

Fix mamba moe unit test after commit reversion (NVIDIA#2888)

7fe7f48

Improve error messages in mamba moe unit test (NVIDIA#2889)

ed461d6

Use DynamicInferenceCoordinator for text generation server (NVIDIA#1910)

c0b2859

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: shanmugamr1992 <shanmugamr1992@gmail.com> Co-authored-by: Shanmugam Ramasamy <shanmugamr@cw-dfw-cs-001-login-01.cm.cluster>

santhnm2 and others added 9 commits January 31, 2026 08:57

Miscellaneous inference cleanup (NVIDIA#2955)

ffbc43f

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Revert "Miscellaneous inference cleanup (NVIDIA#2955)"

0fe3232

This reverts commit ffbc43f.

ci: Fix DSv3 (NVIDIA#3188)

69a5c63

Signed-off-by: oliver könig <okoenig@nvidia.com>

Fix missing argument in MoELayer.forward() (NVIDIA#3133)

2fadde8

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

Fix H2D stream synchronization in optimizer offload (NVIDIA#3140)

ae67076

Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

Add MTP support for hybrid models (NVIDIA#2363)

300d1b6

Co-authored-by: Rabeeh Mahabadi <rkarimimahab@nb-hel-cs-001-vscode-02.cm.cluster> Co-authored-by: Sanjeev Satheesh <sasatheesh@nvidia.com> Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

docs: improve Megatron-LM and Megatron Core descriptions (NVIDIA#3115)

dceb1fb

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

Handle step key correctly in checkpoint save with `--optimizer-cpu-…

f4502eb

…offload` (NVIDIA#2874)

Merge commit 'f4502eb1c92f77e0ed190cac00293e8ac192543b' into rocm_dev

27f7be5

github-actions bot added the stale label Mar 18, 2026

sudhu2k self-assigned this Mar 18, 2026

github-actions bot removed the stale label Mar 19, 2026

r0.16.0 IFU Cleanup

fc49325

sudhu2k force-pushed the sudhu/Megatron-IFU-core_0.16.0_sync branch from b982cb0 to fc49325 Compare April 3, 2026 19:54

Merge branch 'rocm_dev' into sudhu/Megatron-IFU-core_0.16.0_sync

eb2afc7

sudhu2k force-pushed the sudhu/Megatron-IFU-core_0.16.0_sync branch from a192ae0 to eb2afc7 Compare April 4, 2026 17:56

sudhu2k requested a review from wenchenvincent April 4, 2026 22:29

sudhu2k marked this pull request as ready for review April 4, 2026 22:31

sudhu2k changed the title ~~[WIP] Sudhu/megatron ifu core 0.16.0 sync~~ Sudhu/megatron ifu core 0.16.0 sync Apr 4, 2026

sudhu2k commented Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudhu/megatron ifu core 0.16.0 sync#119

Sudhu/megatron ifu core 0.16.0 sync#119
sudhu2k wants to merge 696 commits intorocm_devfrom
sudhu/Megatron-IFU-core_0.16.0_sync

sudhu2k commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026 •

edited

Loading

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

sudhu2k Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

		@@ -1,3 +1,6 @@
		# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		pip install --no-build-isolation .


		RUN git clone https://github.com/NVIDIA-NeMo/Emerging-Optimizers.git &&\

Conversation

sudhu2k commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sudhu2k Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

sudhu2k commented Mar 3, 2026 •

edited

Loading

sudhu2k Apr 6, 2026 •

edited

Loading