Benchmarks: Micro benchmark - add nvbench based kernel-launch & sleep-kernel #750

WenqingLan1 · 2025-10-09T23:12:33Z

This pull request adds support for NVBench-based GPU micro-benchmarks to SuperBench.

Integrated the NVBench submodule
Implemented two benchmarks
- nvbench-sleep-kernel
- nvbench-kernel-launch
updated documentation and added example scripts

Example config:

version: v0.12
superbench:
  enable:
  # nvbench benchmarks
  - nvbench-sleep-kernel:single
  - nvbench-sleep-kernel:list
  - nvbench-sleep-kernel:range
  - nvbench-sleep-kernel:range-step
  - nvbench-kernel-launch
  var:
    default_local_mode: &default_local_mode
      modes:
      - name: local
        proc_num: 4
        prefix: CUDA_VISIBLE_DEVICES={proc_rank}
        parallel: yes
  benchmarks:
    nvbench-sleep-kernel:single:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "50"                   # Single value format
        timeout: 30
    nvbench-sleep-kernel:list:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[25,50,75]"         # List format - no spaces after commas
        timeout: 30
    nvbench-sleep-kernel:range:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[0:5]"           # Range format
        timeout: 30
    nvbench-sleep-kernel:range-step:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[0:50:10]"         # Range with step format
        timeout: 30
    nvbench-kernel-launch:
      <<: *default_local_mode
      timeout: 300

codecov · 2025-10-10T20:44:21Z

Codecov Report

❌ Patch coverage is 89.11917% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.78%. Comparing base (c99380b) to head (2877feb).

Files with missing lines	Patch %	Lines
...rbench/benchmarks/micro_benchmarks/nvbench_base.py	80.39%	20 Missing ⚠️
...enchmarks/micro_benchmarks/nvbench_sleep_kernel.py	98.07%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #750      +/-   ##
==========================================
+ Coverage   85.69%   85.78%   +0.08%     
==========================================
  Files         102      105       +3     
  Lines        7699     7892     +193     
==========================================
+ Hits         6598     6770     +172     
- Misses       1101     1122      +21

Flag	Coverage Δ
cpu-python3.10-unit-test	`71.41% <88.94%> (+0.43%)`	⬆️
cpu-python3.7-unit-test	`70.88% <89.11%> (+0.46%)`	⬆️
cuda-unit-test	`83.74% <88.94%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds NVBench-based CUDA GPU micro-benchmarks to SuperBench, including build integration, result parsing, tests, examples, and documentation updates.

Changes:

Adds NVBench submodule integration and a cuda_nvbench third-party build target.
Introduces two new micro-benchmarks (nvbench-sleep-kernel, nvbench-kernel-launch) with parsing + unit tests.
Updates Docker images, docs, and CI workflow to support required tooling (notably newer CMake for NVBench).

Reviewed changes

Copilot reviewed 20 out of 23 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
third_party/nvbench	Adds NVBench as a git submodule dependency.
third_party/Makefile	Adds `cuda_nvbench` build/install target and adjusts recipe indentation.
tests/data/nvbench_sleep_kernel.log	Adds a sample NVBench sleep-kernel output fixture for parsing tests.
tests/data/nvbench_kernel_launch.log	Adds a sample NVBench kernel-launch output fixture for parsing tests.
tests/benchmarks/micro_benchmarks/test_nvbench_sleep_kernel.py	Adds unit tests for sleep-kernel preprocess and parsing.
tests/benchmarks/micro_benchmarks/test_nvbench_kernel_launch.py	Adds unit tests for kernel-launch preprocess and parsing.
superbench/benchmarks/micro_benchmarks/nvbench_sleep_kernel.py	Implements the NVBench sleep-kernel benchmark wrapper + output parser.
superbench/benchmarks/micro_benchmarks/nvbench_kernel_launch.py	Implements the NVBench kernel-launch benchmark wrapper + output parser.
superbench/benchmarks/micro_benchmarks/nvbench_base.py	Adds a shared NVBench benchmark base class (CLI args, parsing helpers).
superbench/benchmarks/micro_benchmarks/nvbench/sleep_kernel.cu	Adds NVBench CUDA benchmark implementing a sleep/busy-wait kernel.
superbench/benchmarks/micro_benchmarks/nvbench/kernel_launch.cu	Adds NVBench CUDA benchmark for empty-kernel launch overhead.
superbench/benchmarks/micro_benchmarks/nvbench/CMakeLists.txt	Adds CMake build for NVBench-based benchmark executables.
superbench/benchmarks/micro_benchmarks/init.py	Exports the new NVBench benchmarks from the micro-benchmarks package.
examples/benchmarks/nvbench_sleep_kernel.py	Adds an example runner for the sleep-kernel benchmark.
examples/benchmarks/nvbench_kernel_launch.py	Adds an example runner for the kernel-launch benchmark.
docs/user-tutorial/benchmarks/micro-benchmarks.md	Documents the new NVBench benchmarks and their metrics.
dockerfile/rocm5.0.x.dockerfile	Updates Intel MLC download version used in the ROCm image.
dockerfile/cuda13.0.dockerfile	Installs newer CMake and builds `cuda_nvbench` in the CUDA image.
dockerfile/cuda12.9.dockerfile	Installs newer CMake and builds `cuda_nvbench` in the CUDA image.
dockerfile/cuda12.8.dockerfile	Installs newer CMake and builds `cuda_nvbench` in the CUDA image.
.gitmodules	Registers the `third_party/nvbench` submodule.
.gitignore	Ignores `compile_commands.json`.
.github/workflows/codeql-analysis.yml	Upgrades CodeQL actions to v3 and adds CMake setup for the C++ job.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-23T00:05:44Z

superbench/benchmarks/micro_benchmarks/nvbench_sleep_kernel.py

+            gpu_section = r'### \[(\d+)\] NVIDIA'
+            # Regex pattern to handle different time units and flexible spacing
+            row_pat = (
+                r'\|\s*([0-9]+)\s*\|\s*'    # Duration (us)
+                r'([0-9]+)x\s*\|\s*'    # Samples
+                r'([\d.]+\s*[μmun]?s)\s*\|\s*'    # CPU Time (μs, ns, ms, us, s)
+                r'([\d.]+%)\s*\|\s*'    # CPU Noise percentage
+                r'([\d.]+\s*[μmun]?s)\s*\|\s*'    # GPU Time
+                r'([\d.]+%)\s*\|\s*'    # GPU Noise percentage
+                r'([0-9]+)x\s*\|\s*'    # Batch Samples
+                r'([\d.]+\s*[μmun]?s)\s*\|'    # Batch GPU Time
+            )


The parser expects each data row to start with a single |, but the provided fixture rows start with || (e.g., markdown tables). With re.match, this prevents any row from matching and will trigger No valid rows parsed. Update the regex to accept one-or-more leading pipes (e.g., ^\\|+) so both | ... and || ... formats parse correctly.

Copilot · 2026-01-23T00:05:45Z

superbench/benchmarks/micro_benchmarks/nvbench_kernel_launch.py

+            gpu_section = r'### \[(\d+)\] NVIDIA'
+            # Regex pattern to handle different time units and flexible spacing
+            row_pat = (
+                r'\|\s*([0-9]+)x\s*\|\s*'    # Samples
+                r'([\d.]+\s*[μmun]?s)\s*\|\s*'    # CPU Time (μs, ns, ms, us, s)
+                r'([\d.]+%)\s*\|\s*'    # CPU Noise percentage
+                r'([\d.]+\s*[μmun]?s)\s*\|\s*'    # GPU Time
+                r'([\d.]+%)\s*\|\s*'    # GPU Noise percentage
+                r'([0-9]+)x\s*\|\s*'    # Batch Samples
+                r'([\d.]+\s*[μmun]?s)\s*\|'    # Batch GPU Time
+            )


Same issue as nvbench_sleep_kernel: the row regex only matches lines starting with a single |, but the fixture output uses ||. This will make parsing fail. Allow one-or-more leading pipes (anchor with ^\\|+) so both formats are supported.

Copilot · 2026-01-23T00:05:45Z

superbench/benchmarks/micro_benchmarks/nvbench_base.py

+def parse_time_to_us(raw: str) -> float:
+    """Helper: parse '123.45 us', '678.9 ns', '0.12 ms' → float µs."""
+    raw = raw.strip()
+    if raw.endswith('%'):
+        return float(raw[:-1])
+    # split "value unit" or "valueunit"
+    m = re.match(r'([\d.]+)\s*([mun]?s)?', raw)
+    if not m:
+        return float(raw)
+    val, unit = float(m.group(1)), (m.group(2) or 'us')
+    if unit == 'ns':
+        return val / 1e3
+    if unit == 'ms':
+        return val * 1e3
+    return val


parse_time_to_us currently does not convert seconds (s) to microseconds (it falls through and returns val). Since your row regex explicitly allows plain s, this yields incorrect results by a factor of 1e6 when NVBench reports seconds. Add explicit handling for unit == 's' (multiply by 1e6), and consider anchoring the regex to the end of the string to avoid partial matches.

Copilot · 2026-01-23T00:05:46Z

superbench/benchmarks/micro_benchmarks/nvbench/sleep_kernel.cu

+    .add_int64_axis("Duration (us)", nvbench::range(0, 100, 5))
+    .set_timeout(1); // Limit to one second per measurement.


This hard-codes a 1s timeout at the benchmark definition level, which can override/conflict with the CLI --timeout that SuperBench passes through (and tests/configs expect to control). To make --timeout effective and consistent across NVBench benchmarks, remove the .set_timeout(1) override (or only apply it when no explicit timeout is provided).

Suggested change

.add_int64_axis("Duration (us)", nvbench::range(0, 100, 5))

.set_timeout(1); // Limit to one second per measurement.

.add_int64_axis("Duration (us)", nvbench::range(0, 100, 5));

Copilot · 2026-01-23T00:05:46Z

tests/benchmarks/micro_benchmarks/test_nvbench_sleep_kernel.py

+        assert benchmark.result['duration_us_25_cpu_time'][0] == 42.123
+        # assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78
+        assert benchmark.result['duration_us_25_gpu_time'][0] == 25.321
+        # assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93
+        # assert benchmark.result['duration_us_25_batch_samples'][0] == 17448
+        assert benchmark.result['duration_us_25_batch_gpu_time'][0] == 23.456


These tests compare floats for exact equality, which is fragile due to floating-point representation (e.g., parsing may produce 42.123000000000005). Use unittest.TestCase float assertions (e.g., assertAlmostEqual) with a tolerance to avoid flaky failures.

Suggested change

assert benchmark.result['duration_us_25_cpu_time'][0] == 42.123

# assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78

assert benchmark.result['duration_us_25_gpu_time'][0] == 25.321

# assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93

# assert benchmark.result['duration_us_25_batch_samples'][0] == 17448

assert benchmark.result['duration_us_25_batch_gpu_time'][0] == 23.456

self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123, places=6)

# assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78

self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321, places=6)

# assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93

# assert benchmark.result['duration_us_25_batch_samples'][0] == 17448

self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456, places=6)

Copilot · 2026-01-23T00:05:46Z

superbench/benchmarks/micro_benchmarks/__init__.py

+    'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark',
+    'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark',
+    'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark',
+    'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark',
+    'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke',
+    'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul',
+    'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw',
+    'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel'


Collapsing __all__ to long comma-separated lines reduces readability and likely violates typical line-length formatting used elsewhere in the project. Consider reverting to one-entry-per-line (or a more structured wrap) to keep diffs smaller and maintenance easier.

Suggested change

'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark',

'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark',

'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark',

'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark',

'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke',

'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul',

'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw',

'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel'

'BlasLtBaseBenchmark',

'ComputationCommunicationOverlap',

'CpuMemBwLatencyBenchmark',

'CpuHplBenchmark',

'CpuStreamBenchmark',

'CublasBenchmark',

'CublasLtBenchmark',

'CudaGemmFlopsBenchmark',

'CudaMemBwBenchmark',

'CudaNcclBwBenchmark',

'CudnnBenchmark',

'DiskBenchmark',

'DistInference',

'HipBlasLtBenchmark',

'GPCNetBenchmark',

'GemmFlopsBenchmark',

'GpuBurnBenchmark',

'GpuCopyBwBenchmark',

'GpuStreamBenchmark',

'IBBenchmark',

'IBLoopbackBenchmark',

'KernelLaunch',

'MemBwBenchmark',

'MicroBenchmark',

'MicroBenchmarkWithInvoke',

'ORTInferenceBenchmark',

'RocmGemmFlopsBenchmark',

'RocmMemBwBenchmark',

'ShardingMatmul',

'TCPConnectivityBenchmark',

'TensorRTInferenceBenchmark',

'DirectXGPUEncodingLatency',

'DirectXGPUCopyBw',

'DirectXGPUMemBw',

'DirectXGPUCoreFlops',

'NvBandwidthBenchmark',

'NvbenchKernelLaunch',

'NvbenchSleepKernel',

Copilot · 2026-01-23T00:05:47Z

.github/workflows/codeql-analysis.yml

-          DEBIAN_FRONTEND=noninteractive apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev sudo
+          DEBIAN_FRONTEND=noninteractive apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev sudo build-essential
+      - name: Setup CMake
+        uses: lukka/get-cmake@latest


Using @latest for third-party GitHub Actions is a supply-chain risk and can lead to non-reproducible CI behavior. Pin this action to a specific tagged version or commit SHA.

Suggested change

uses: lukka/get-cmake@latest

uses: lukka/get-cmake@v3.20.0

Copilot · 2026-01-23T00:05:47Z

third_party/Makefile

 	&& git -C msccl checkout 87048bd && git -C msccl submodule update --recursive --init
 else ifeq ($(shell echo $(CUDA_VER)">=12.8" | bc -l), 1)
-    # Get commit 87048bd from msscl to support updated nccl and sm_100
+	# Get commit 87048bd from msscl to support updated nccl and sm_100


Typo in comment: change msscl to msccl.

Suggested change

# Get commit 87048bd from msscl to support updated nccl and sm_100

# Get commit 87048bd from msccl to support updated nccl and sm_100

WenqingLan1 and others added 15 commits July 22, 2025 16:03

add nvbench kernel launch

741ee98

submodule update

0ae7864

init sleep kernel

35bfb61

Merge branch 'microsoft:main' into feat/third_party/nvbench

66b4786

Merge branch 'microsoft:main' into feat/third_party/nvbench

82aed0c

Merge branch 'microsoft:main' into feat/third_party/nvbench

24ee0a5

test sleep kernel

bd87f50

add sm 103

a663db6

add arg parsing logic

32fe197

Merge branch 'microsoft:main' into feat/third_party/nvbench

76562dc

add arg parsing tests

3eb5525

refactor

4785fe6

refine logic - remove gpu_id

1fb7c05

add doc

83c442c

refine regex & update nvbench submodule

4b274c4

WenqingLan1 requested a review from a team as a code owner October 9, 2025 23:12

WenqingLan1 added benchmarks SuperBench Benchmarks micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks labels Oct 9, 2025

WenqingLan1 added 8 commits October 10, 2025 16:48

update cmake

0cf48bb

fix lint

5905647

fix lint

baa57c9

fix import

ecce2d9

fix

3a58ead

fix

d0d8773

fix

fbb5969

fix

f007745

WenqingLan1 added 3 commits October 10, 2025 21:23

fix

b6b6082

fix

0f2c838

fix

5bd20f6

WenqingLan1 added 5 commits October 10, 2025 22:30

fix pipeline

ab88d25

fix cmake

3faaf60

fix pipeline

896a46a

fix pipeline

5d4986b

fix pipeline & mlc version

b246522

guoshzhao self-assigned this Oct 17, 2025

WenqingLan1 added 2 commits December 17, 2025 15:51

Merge branch 'microsoft:main' into feat/third_party/nvbench

ffe182e

Merge branch 'main' into feat/third_party/nvbench

2877feb

polarG requested a review from Copilot January 23, 2026 00:00

Copilot AI reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks: Micro benchmark - add nvbench based kernel-launch & sleep-kernel #750

Benchmarks: Micro benchmark - add nvbench based kernel-launch & sleep-kernel #750

Uh oh!

WenqingLan1 commented Oct 9, 2025

Uh oh!

codecov bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		.add_int64_axis("Duration (us)", nvbench::range(0, 100, 5))
		.set_timeout(1); // Limit to one second per measurement. No newline at end of file

	# Get commit 87048bd from msscl to support updated nccl and sm_100
	# Get commit 87048bd from msccl to support updated nccl and sm_100

Benchmarks: Micro benchmark - add nvbench based kernel-launch & sleep-kernel #750

Are you sure you want to change the base?

Benchmarks: Micro benchmark - add nvbench based kernel-launch & sleep-kernel #750

Uh oh!

Conversation

WenqingLan1 commented Oct 9, 2025

Uh oh!

codecov bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 10, 2025 •

edited

Loading