-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Summary
Add minimal kernel autotuning support to TornadoVM to automatically evaluate and select efficient execution configurations (e.g., work-group and grid sizes) at runtime, inspired by Triton’s triton.autotune.
Scope (Intentionally Small)
This sub-issue focuses on a first, minimal autotuning capability:
- Support autotuning for work-group and grid dimensions only
- Limit autotuning to single-kernel task graphs
- Perform autotuning once per kernel per device
- Cache the best configuration in-memory (no persistence required)
This is intended as a foundation for future extensions (tiling, memory layouts, heuristics).
Proposed Functionality
-
Configuration Set
- Allow a small, user-defined set of candidate execution configurations
- Example: different
(globalSize, localSize)combinations
-
Runtime Benchmarking
- On first execution, run each configuration and measure execution time
-
Selection & Caching
- Select the fastest configuration
- Cache the result for subsequent executions on the same device
-
Transparent Integration
- Autotuned configuration replaces the default execution without user-side changes
Motivation
Performance in TornadoVM is sensitive to execution parameters and GPU architecture.
Today, finding good configurations is manual and hardware-specific.
Even a limited autotuning mechanism would:
- Reduce manual tuning effort
- Improve out-of-the-box performance
- Benefit GPU-heavy workloads (e.g., attention, GEMM) used in GPULlama3
Example (Hypothetical)
A kernel is executed with candidate local sizes:
(16,16)(32,8)(8,32)
TornadoVM benchmarks each once, selects the fastest, and reuses it for all future executions.
Out of Scope (For This Sub-Issue)
- Persistent caching across JVM runs
- Large configuration search spaces
- Compiler-driven autotune generation
- Memory tiling or algorithmic variants
Expected Outcome
A small but functional autotuning mechanism that demonstrates feasibility and provides immediate performance benefits, serving as a stepping stone toward a full Triton-like autotuning framework.