Skip to content

Improve GPU defaults, caching, and pointwise scheduling#4668

Open
Rolaand-Jayz wants to merge 13 commits intoROCm:developfrom
Rolaand-Jayz:opt/migraphx-performance-pr
Open

Improve GPU defaults, caching, and pointwise scheduling#4668
Rolaand-Jayz wants to merge 13 commits intoROCm:developfrom
Rolaand-Jayz:opt/migraphx-performance-pr

Conversation

@Rolaand-Jayz
Copy link

@Rolaand-Jayz Rolaand-Jayz commented Mar 13, 2026

Summary

This PR improves several GPU-side MiGraphX defaults and hot paths that were leaving performance on the table, especially on recent RDNA-class parts.

Changes in this series:

  • auto-tune default GPU stream count by device CU count
  • select NHWC and MLIR attention defaults by architecture instead of relying on manual env tuning
  • skip a redundant host bounce on final GPU fallback returns
  • fix GPU linkage when MIGRAPHX_ENABLE_MLIR=Off
  • cache repeated GPU feature/device-name probes
  • cache repeated HIP code compilations
  • cache repeated MIOpen convolution solution lookups
  • avoid scheduling undersized async stream partitions
  • tune untiled GPU pointwise launch bounds by wavefront size instead of falling back to oversized local=1024 launches

The pointwise launch change is the most visible runtime fix in the current OpenProteus path: on wave32 hardware it reduces PReLU-heavy pointwise kernels to local=128, which avoids the oversized launch configuration emitted previously.

Benchmark

Measured on gfx1100 with:

  • model: 2x_OpenProteus_Compact_i2_70K_fp32.onnx
  • shape: 8x3x192x192
  • precision: fp16
  • command: migraphx-driver perf --migraphx --gpu --enable-offload-copy

Latest pointwise-launch patch vs previous series head:

  • previous: 11.80 ms total, 11.90 ms mean
  • this branch: 8.31 ms total, 8.39 ms mean

That is roughly a 30% reduction on the isolated MiGraphX perf path for this workload.

Validation

Locally validated with:

  • cmake --build ... --target migraphx_gpu driver test_gpu_jit
  • test_gpu_jit compile_pointwise
  • test_gpu_jit compile_pointwise_launch_bounds
  • repeated migraphx-driver compile / migraphx-driver perf runs on gfx1100

Note: local validation was done in an MLIR-disabled build because this environment does not provide rocMLIR; upstream CI should cover the full build matrix.

Changelog Category

  • Added: New functionality.
  • Changed: Changes to existing functionality.
  • Removed: Functionality or support that has been removed. (Compared to a previous release)
  • Optimized: Component performance that has been optimized or improved.
  • Resolved Issues: Known issues from a previous version that have been resolved.
  • Not Applicable: This PR is not to be included in the changelog.

@Rolaand-Jayz Rolaand-Jayz requested review from a team and causten as code owners March 13, 2026 20:01
Copilot AI review requested due to automatic review settings March 13, 2026 20:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates MiGraphX’s GPU target defaults and hot paths to improve performance on modern AMD GPUs (notably RDNA), adding architecture-aware heuristics, caching for repeated GPU compilation/solver work, and tuning scheduler/pointwise launch behavior.

Changes:

  • Add adaptive GPU stream-count defaults (based on compute unit count) and document the new MIGRAPHX_NSTREAMS=0 “adaptive” behavior.
  • Make NHWC layout and MLIR attention defaults architecture-aware via new gfx_* helpers, and cache repeated device feature/name probes.
  • Add caching for HIP compilation and MIOpen convolution solution lookups, plus scheduling/pointwise launch tuning and accompanying tests.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/schedule_test.cpp Adds coverage for new scheduler split-threshold behavior.
test/gpu/jit.cpp Adds a test ensuring pointwise launch bounds follow wavefront sizing.
test/gpu/device_name.cpp Adds tests for new architecture-detection helpers and defaults.
test/gpu/context_serialize.cpp Adds tests for stream-count heuristic and default stream count resolution.
test/gpu/compile_miopen_cache.cpp New test validating MIOpen convolution solution caching.
test/gpu/compile_hip_cache.cpp New test validating HIP compilation caching.
test/gpu/adjust_allocation.cpp Adds test ensuring offload-copy return path avoids redundant host bounce.
src/targets/gpu/target.cpp Enables NHWC by default on selected archs unless explicitly overridden by env.
src/targets/gpu/schedule_model.cpp Introduces GPU split-threshold default to reduce undersized partitions.
src/targets/gpu/mlir.cpp Fixes MLIR-disabled linkage by gating includes and adding stubbed APIs.
src/targets/gpu/lowering.cpp Reuses host value for final return when result was only copied to GPU for pipeline reasons.
src/targets/gpu/jit/pointwise.cpp Tunes untiled pointwise local size by wavefront size (capped).
src/targets/gpu/include/migraphx/gpu/schedule_model.hpp Extends GPU schedule model API with split threshold.
src/targets/gpu/include/migraphx/gpu/device_name.hpp Adds gfx_is_navi and arch-based default helper declarations.
src/targets/gpu/include/migraphx/gpu/convolution.hpp Adds in-process caching for MIOpen convolution solution selection.
src/targets/gpu/include/migraphx/gpu/context.hpp Adds adaptive stream-count resolver and changes default ctor behavior.
src/targets/gpu/fuse_mlir.cpp Switches MLIR attention defaults to new arch helper logic.
src/targets/gpu/device_name.cpp Adds cached device-info probing and implements new arch helper APIs.
src/targets/gpu/compile_hip.cpp Adds caching of HIP compilation results to avoid repeated recompiles.
src/schedule.cpp Plumbs new split-threshold through scheduler partitioning.
src/msgpack.cpp Adds missing include needed for compilation.
src/include/migraphx/schedule_model.hpp Extends type-erased scheduler model interface to include split threshold.
docs/reference/MIGraphX-dev-env-vars.rst Documents new NHWC and adaptive stream-count default behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@anisha-amd anisha-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add to the changelog as well.

@Rolaand-Jayz
Copy link
Author

Addressed the inline review items in commit 1d810f1.

  • Guarded scoped_env_var behind #ifndef _WIN32 in both cache tests so MSVC does not compile setenv/unsetenv.
  • Changed get_cached_device_info() to return by value to avoid returning a reference into the cached unordered_map after unlocking.
  • Added CHANGELOG.md entries under Develop / Changed and Develop / Optimized for PR Improve GPU defaults, caching, and pointwise scheduling #4668, covering the architecture-aware GPU defaults and the caching/scheduling/pointwise optimizations.

@causten causten requested a review from pfultz2 March 16, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants