-
Notifications
You must be signed in to change notification settings - Fork 83
Benchmarks: Micro benchmark - add nvbench based kernel-launch & sleep-kernel #750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #750 +/- ##
==========================================
+ Coverage 85.69% 85.78% +0.08%
==========================================
Files 102 105 +3
Lines 7699 7892 +193
==========================================
+ Hits 6598 6770 +172
- Misses 1101 1122 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds NVBench-based CUDA GPU micro-benchmarks to SuperBench, including build integration, result parsing, tests, examples, and documentation updates.
Changes:
- Adds NVBench submodule integration and a
cuda_nvbenchthird-party build target. - Introduces two new micro-benchmarks (
nvbench-sleep-kernel,nvbench-kernel-launch) with parsing + unit tests. - Updates Docker images, docs, and CI workflow to support required tooling (notably newer CMake for NVBench).
Reviewed changes
Copilot reviewed 20 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| third_party/nvbench | Adds NVBench as a git submodule dependency. |
| third_party/Makefile | Adds cuda_nvbench build/install target and adjusts recipe indentation. |
| tests/data/nvbench_sleep_kernel.log | Adds a sample NVBench sleep-kernel output fixture for parsing tests. |
| tests/data/nvbench_kernel_launch.log | Adds a sample NVBench kernel-launch output fixture for parsing tests. |
| tests/benchmarks/micro_benchmarks/test_nvbench_sleep_kernel.py | Adds unit tests for sleep-kernel preprocess and parsing. |
| tests/benchmarks/micro_benchmarks/test_nvbench_kernel_launch.py | Adds unit tests for kernel-launch preprocess and parsing. |
| superbench/benchmarks/micro_benchmarks/nvbench_sleep_kernel.py | Implements the NVBench sleep-kernel benchmark wrapper + output parser. |
| superbench/benchmarks/micro_benchmarks/nvbench_kernel_launch.py | Implements the NVBench kernel-launch benchmark wrapper + output parser. |
| superbench/benchmarks/micro_benchmarks/nvbench_base.py | Adds a shared NVBench benchmark base class (CLI args, parsing helpers). |
| superbench/benchmarks/micro_benchmarks/nvbench/sleep_kernel.cu | Adds NVBench CUDA benchmark implementing a sleep/busy-wait kernel. |
| superbench/benchmarks/micro_benchmarks/nvbench/kernel_launch.cu | Adds NVBench CUDA benchmark for empty-kernel launch overhead. |
| superbench/benchmarks/micro_benchmarks/nvbench/CMakeLists.txt | Adds CMake build for NVBench-based benchmark executables. |
| superbench/benchmarks/micro_benchmarks/init.py | Exports the new NVBench benchmarks from the micro-benchmarks package. |
| examples/benchmarks/nvbench_sleep_kernel.py | Adds an example runner for the sleep-kernel benchmark. |
| examples/benchmarks/nvbench_kernel_launch.py | Adds an example runner for the kernel-launch benchmark. |
| docs/user-tutorial/benchmarks/micro-benchmarks.md | Documents the new NVBench benchmarks and their metrics. |
| dockerfile/rocm5.0.x.dockerfile | Updates Intel MLC download version used in the ROCm image. |
| dockerfile/cuda13.0.dockerfile | Installs newer CMake and builds cuda_nvbench in the CUDA image. |
| dockerfile/cuda12.9.dockerfile | Installs newer CMake and builds cuda_nvbench in the CUDA image. |
| dockerfile/cuda12.8.dockerfile | Installs newer CMake and builds cuda_nvbench in the CUDA image. |
| .gitmodules | Registers the third_party/nvbench submodule. |
| .gitignore | Ignores compile_commands.json. |
| .github/workflows/codeql-analysis.yml | Upgrades CodeQL actions to v3 and adds CMake setup for the C++ job. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| gpu_section = r'### \[(\d+)\] NVIDIA' | ||
| # Regex pattern to handle different time units and flexible spacing | ||
| row_pat = ( | ||
| r'\|\s*([0-9]+)\s*\|\s*' # Duration (us) | ||
| r'([0-9]+)x\s*\|\s*' # Samples | ||
| r'([\d.]+\s*[μmun]?s)\s*\|\s*' # CPU Time (μs, ns, ms, us, s) | ||
| r'([\d.]+%)\s*\|\s*' # CPU Noise percentage | ||
| r'([\d.]+\s*[μmun]?s)\s*\|\s*' # GPU Time | ||
| r'([\d.]+%)\s*\|\s*' # GPU Noise percentage | ||
| r'([0-9]+)x\s*\|\s*' # Batch Samples | ||
| r'([\d.]+\s*[μmun]?s)\s*\|' # Batch GPU Time | ||
| ) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser expects each data row to start with a single |, but the provided fixture rows start with || (e.g., markdown tables). With re.match, this prevents any row from matching and will trigger No valid rows parsed. Update the regex to accept one-or-more leading pipes (e.g., ^\\|+) so both | ... and || ... formats parse correctly.
| gpu_section = r'### \[(\d+)\] NVIDIA' | ||
| # Regex pattern to handle different time units and flexible spacing | ||
| row_pat = ( | ||
| r'\|\s*([0-9]+)x\s*\|\s*' # Samples | ||
| r'([\d.]+\s*[μmun]?s)\s*\|\s*' # CPU Time (μs, ns, ms, us, s) | ||
| r'([\d.]+%)\s*\|\s*' # CPU Noise percentage | ||
| r'([\d.]+\s*[μmun]?s)\s*\|\s*' # GPU Time | ||
| r'([\d.]+%)\s*\|\s*' # GPU Noise percentage | ||
| r'([0-9]+)x\s*\|\s*' # Batch Samples | ||
| r'([\d.]+\s*[μmun]?s)\s*\|' # Batch GPU Time | ||
| ) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as nvbench_sleep_kernel: the row regex only matches lines starting with a single |, but the fixture output uses ||. This will make parsing fail. Allow one-or-more leading pipes (anchor with ^\\|+) so both formats are supported.
| def parse_time_to_us(raw: str) -> float: | ||
| """Helper: parse '123.45 us', '678.9 ns', '0.12 ms' → float µs.""" | ||
| raw = raw.strip() | ||
| if raw.endswith('%'): | ||
| return float(raw[:-1]) | ||
| # split "value unit" or "valueunit" | ||
| m = re.match(r'([\d.]+)\s*([mun]?s)?', raw) | ||
| if not m: | ||
| return float(raw) | ||
| val, unit = float(m.group(1)), (m.group(2) or 'us') | ||
| if unit == 'ns': | ||
| return val / 1e3 | ||
| if unit == 'ms': | ||
| return val * 1e3 | ||
| return val |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_time_to_us currently does not convert seconds (s) to microseconds (it falls through and returns val). Since your row regex explicitly allows plain s, this yields incorrect results by a factor of 1e6 when NVBench reports seconds. Add explicit handling for unit == 's' (multiply by 1e6), and consider anchoring the regex to the end of the string to avoid partial matches.
| .add_int64_axis("Duration (us)", nvbench::range(0, 100, 5)) | ||
| .set_timeout(1); // Limit to one second per measurement. No newline at end of file |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hard-codes a 1s timeout at the benchmark definition level, which can override/conflict with the CLI --timeout that SuperBench passes through (and tests/configs expect to control). To make --timeout effective and consistent across NVBench benchmarks, remove the .set_timeout(1) override (or only apply it when no explicit timeout is provided).
| .add_int64_axis("Duration (us)", nvbench::range(0, 100, 5)) | |
| .set_timeout(1); // Limit to one second per measurement. | |
| .add_int64_axis("Duration (us)", nvbench::range(0, 100, 5)); |
| assert benchmark.result['duration_us_25_cpu_time'][0] == 42.123 | ||
| # assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78 | ||
| assert benchmark.result['duration_us_25_gpu_time'][0] == 25.321 | ||
| # assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93 | ||
| # assert benchmark.result['duration_us_25_batch_samples'][0] == 17448 | ||
| assert benchmark.result['duration_us_25_batch_gpu_time'][0] == 23.456 |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests compare floats for exact equality, which is fragile due to floating-point representation (e.g., parsing may produce 42.123000000000005). Use unittest.TestCase float assertions (e.g., assertAlmostEqual) with a tolerance to avoid flaky failures.
| assert benchmark.result['duration_us_25_cpu_time'][0] == 42.123 | |
| # assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78 | |
| assert benchmark.result['duration_us_25_gpu_time'][0] == 25.321 | |
| # assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93 | |
| # assert benchmark.result['duration_us_25_batch_samples'][0] == 17448 | |
| assert benchmark.result['duration_us_25_batch_gpu_time'][0] == 23.456 | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123, places=6) | |
| # assert benchmark.result['duration_us_25_cpu_noise'][0] == 69.78 | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321, places=6) | |
| # assert benchmark.result['duration_us_25_gpu_noise'][0] == 0.93 | |
| # assert benchmark.result['duration_us_25_batch_samples'][0] == 17448 | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456, places=6) |
| 'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark', | ||
| 'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark', | ||
| 'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark', | ||
| 'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark', | ||
| 'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke', | ||
| 'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul', | ||
| 'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw', | ||
| 'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel' |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collapsing __all__ to long comma-separated lines reduces readability and likely violates typical line-length formatting used elsewhere in the project. Consider reverting to one-entry-per-line (or a more structured wrap) to keep diffs smaller and maintenance easier.
| 'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark', | |
| 'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark', | |
| 'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark', | |
| 'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark', | |
| 'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke', | |
| 'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul', | |
| 'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw', | |
| 'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel' | |
| 'BlasLtBaseBenchmark', | |
| 'ComputationCommunicationOverlap', | |
| 'CpuMemBwLatencyBenchmark', | |
| 'CpuHplBenchmark', | |
| 'CpuStreamBenchmark', | |
| 'CublasBenchmark', | |
| 'CublasLtBenchmark', | |
| 'CudaGemmFlopsBenchmark', | |
| 'CudaMemBwBenchmark', | |
| 'CudaNcclBwBenchmark', | |
| 'CudnnBenchmark', | |
| 'DiskBenchmark', | |
| 'DistInference', | |
| 'HipBlasLtBenchmark', | |
| 'GPCNetBenchmark', | |
| 'GemmFlopsBenchmark', | |
| 'GpuBurnBenchmark', | |
| 'GpuCopyBwBenchmark', | |
| 'GpuStreamBenchmark', | |
| 'IBBenchmark', | |
| 'IBLoopbackBenchmark', | |
| 'KernelLaunch', | |
| 'MemBwBenchmark', | |
| 'MicroBenchmark', | |
| 'MicroBenchmarkWithInvoke', | |
| 'ORTInferenceBenchmark', | |
| 'RocmGemmFlopsBenchmark', | |
| 'RocmMemBwBenchmark', | |
| 'ShardingMatmul', | |
| 'TCPConnectivityBenchmark', | |
| 'TensorRTInferenceBenchmark', | |
| 'DirectXGPUEncodingLatency', | |
| 'DirectXGPUCopyBw', | |
| 'DirectXGPUMemBw', | |
| 'DirectXGPUCoreFlops', | |
| 'NvBandwidthBenchmark', | |
| 'NvbenchKernelLaunch', | |
| 'NvbenchSleepKernel', |
| DEBIAN_FRONTEND=noninteractive apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev sudo | ||
| DEBIAN_FRONTEND=noninteractive apt-get install -y ffmpeg libavcodec-dev libavformat-dev libavutil-dev libswresample-dev sudo build-essential | ||
| - name: Setup CMake | ||
| uses: lukka/get-cmake@latest |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using @latest for third-party GitHub Actions is a supply-chain risk and can lead to non-reproducible CI behavior. Pin this action to a specific tagged version or commit SHA.
| uses: lukka/get-cmake@latest | |
| uses: lukka/get-cmake@v3.20.0 |
| && git -C msccl checkout 87048bd && git -C msccl submodule update --recursive --init | ||
| else ifeq ($(shell echo $(CUDA_VER)">=12.8" | bc -l), 1) | ||
| # Get commit 87048bd from msscl to support updated nccl and sm_100 | ||
| # Get commit 87048bd from msscl to support updated nccl and sm_100 |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: change msscl to msccl.
| # Get commit 87048bd from msscl to support updated nccl and sm_100 | |
| # Get commit 87048bd from msccl to support updated nccl and sm_100 |
This pull request adds support for NVBench-based GPU micro-benchmarks to SuperBench.
nvbench-sleep-kernelnvbench-kernel-launchExample config: