Skip to content

[POC] Minimal Kernel Autotuning Support #96

@mikepapadim

Description

@mikepapadim

Summary

Add minimal kernel autotuning support to TornadoVM to automatically evaluate and select efficient execution configurations (e.g., work-group and grid sizes) at runtime, inspired by Triton’s triton.autotune.


Scope (Intentionally Small)

This sub-issue focuses on a first, minimal autotuning capability:

  • Support autotuning for work-group and grid dimensions only
  • Limit autotuning to single-kernel task graphs
  • Perform autotuning once per kernel per device
  • Cache the best configuration in-memory (no persistence required)

This is intended as a foundation for future extensions (tiling, memory layouts, heuristics).


Proposed Functionality

  1. Configuration Set

    • Allow a small, user-defined set of candidate execution configurations
    • Example: different (globalSize, localSize) combinations
  2. Runtime Benchmarking

    • On first execution, run each configuration and measure execution time
  3. Selection & Caching

    • Select the fastest configuration
    • Cache the result for subsequent executions on the same device
  4. Transparent Integration

    • Autotuned configuration replaces the default execution without user-side changes

Motivation

Performance in TornadoVM is sensitive to execution parameters and GPU architecture.
Today, finding good configurations is manual and hardware-specific.

Even a limited autotuning mechanism would:

  • Reduce manual tuning effort
  • Improve out-of-the-box performance
  • Benefit GPU-heavy workloads (e.g., attention, GEMM) used in GPULlama3

Example (Hypothetical)

A kernel is executed with candidate local sizes:

  • (16,16)
  • (32,8)
  • (8,32)

TornadoVM benchmarks each once, selects the fastest, and reuses it for all future executions.


Out of Scope (For This Sub-Issue)

  • Persistent caching across JVM runs
  • Large configuration search spaces
  • Compiler-driven autotune generation
  • Memory tiling or algorithmic variants

Expected Outcome

A small but functional autotuning mechanism that demonstrates feasibility and provides immediate performance benefits, serving as a stepping stone toward a full Triton-like autotuning framework.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions