Improve GPU defaults, caching, and pointwise scheduling#4668
Open
Rolaand-Jayz wants to merge 13 commits intoROCm:developfrom
Open
Improve GPU defaults, caching, and pointwise scheduling#4668Rolaand-Jayz wants to merge 13 commits intoROCm:developfrom
Rolaand-Jayz wants to merge 13 commits intoROCm:developfrom
Conversation
added 9 commits
March 13, 2026 09:30
There was a problem hiding this comment.
Pull request overview
This PR updates MiGraphX’s GPU target defaults and hot paths to improve performance on modern AMD GPUs (notably RDNA), adding architecture-aware heuristics, caching for repeated GPU compilation/solver work, and tuning scheduler/pointwise launch behavior.
Changes:
- Add adaptive GPU stream-count defaults (based on compute unit count) and document the new
MIGRAPHX_NSTREAMS=0“adaptive” behavior. - Make NHWC layout and MLIR attention defaults architecture-aware via new
gfx_*helpers, and cache repeated device feature/name probes. - Add caching for HIP compilation and MIOpen convolution solution lookups, plus scheduling/pointwise launch tuning and accompanying tests.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/schedule_test.cpp | Adds coverage for new scheduler split-threshold behavior. |
| test/gpu/jit.cpp | Adds a test ensuring pointwise launch bounds follow wavefront sizing. |
| test/gpu/device_name.cpp | Adds tests for new architecture-detection helpers and defaults. |
| test/gpu/context_serialize.cpp | Adds tests for stream-count heuristic and default stream count resolution. |
| test/gpu/compile_miopen_cache.cpp | New test validating MIOpen convolution solution caching. |
| test/gpu/compile_hip_cache.cpp | New test validating HIP compilation caching. |
| test/gpu/adjust_allocation.cpp | Adds test ensuring offload-copy return path avoids redundant host bounce. |
| src/targets/gpu/target.cpp | Enables NHWC by default on selected archs unless explicitly overridden by env. |
| src/targets/gpu/schedule_model.cpp | Introduces GPU split-threshold default to reduce undersized partitions. |
| src/targets/gpu/mlir.cpp | Fixes MLIR-disabled linkage by gating includes and adding stubbed APIs. |
| src/targets/gpu/lowering.cpp | Reuses host value for final return when result was only copied to GPU for pipeline reasons. |
| src/targets/gpu/jit/pointwise.cpp | Tunes untiled pointwise local size by wavefront size (capped). |
| src/targets/gpu/include/migraphx/gpu/schedule_model.hpp | Extends GPU schedule model API with split threshold. |
| src/targets/gpu/include/migraphx/gpu/device_name.hpp | Adds gfx_is_navi and arch-based default helper declarations. |
| src/targets/gpu/include/migraphx/gpu/convolution.hpp | Adds in-process caching for MIOpen convolution solution selection. |
| src/targets/gpu/include/migraphx/gpu/context.hpp | Adds adaptive stream-count resolver and changes default ctor behavior. |
| src/targets/gpu/fuse_mlir.cpp | Switches MLIR attention defaults to new arch helper logic. |
| src/targets/gpu/device_name.cpp | Adds cached device-info probing and implements new arch helper APIs. |
| src/targets/gpu/compile_hip.cpp | Adds caching of HIP compilation results to avoid repeated recompiles. |
| src/schedule.cpp | Plumbs new split-threshold through scheduler partitioning. |
| src/msgpack.cpp | Adds missing include needed for compilation. |
| src/include/migraphx/schedule_model.hpp | Extends type-erased scheduler model interface to include split threshold. |
| docs/reference/MIGraphX-dev-env-vars.rst | Documents new NHWC and adaptive stream-count default behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
anisha-amd
reviewed
Mar 13, 2026
Contributor
anisha-amd
left a comment
There was a problem hiding this comment.
Please add to the changelog as well.
Author
|
Addressed the inline review items in commit 1d810f1.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves several GPU-side MiGraphX defaults and hot paths that were leaving performance on the table, especially on recent RDNA-class parts.
Changes in this series:
MIGRAPHX_ENABLE_MLIR=Offlocal=1024launchesThe pointwise launch change is the most visible runtime fix in the current OpenProteus path: on wave32 hardware it reduces PReLU-heavy pointwise kernels to
local=128, which avoids the oversized launch configuration emitted previously.Benchmark
Measured on
gfx1100with:2x_OpenProteus_Compact_i2_70K_fp32.onnx8x3x192x192fp16migraphx-driver perf --migraphx --gpu --enable-offload-copyLatest pointwise-launch patch vs previous series head:
11.80 mstotal,11.90 msmean8.31 mstotal,8.39 msmeanThat is roughly a
30%reduction on the isolated MiGraphX perf path for this workload.Validation
Locally validated with:
cmake --build ... --target migraphx_gpu driver test_gpu_jittest_gpu_jit compile_pointwisetest_gpu_jit compile_pointwise_launch_boundsmigraphx-driver compile/migraphx-driver perfruns ongfx1100Note: local validation was done in an MLIR-disabled build because this environment does not provide
rocMLIR; upstream CI should cover the full build matrix.Changelog Category