[Metal] Add Flash Attention VJP for training #2995

Brooooooklyn · 2026-01-14T03:01:43Z

Summary

Implements fused backward pass (VJP) for scaled_dot_product_attention on Metal GPU. This enables efficient gradient computation during training without falling back to unfused (decomposed) attention operations.

Changes

New Files

mlx/backend/metal/kernels/sdpa_vector_vjp.h - Vector VJP kernel for short sequences (L ≤ 8)
mlx/backend/metal/kernels/steel/attn/kernels/steel_attention_vjp_dq.h - STEEL dQ gradient kernel
mlx/backend/metal/kernels/steel/attn/kernels/steel_attention_vjp_dkv.h - STEEL dK/dV gradient kernel

Modified Files

mlx/backend/metal/scaled_dot_product_attention.cpp - VJP dispatch logic (+840 lines)
mlx/fast.cpp / mlx/fast_primitives.h - Logsumexp caching, VJP routing
python/tests/test_fast_sdpa.py - Comprehensive VJP tests (+220 lines)

Implementation Notes

Uses a two-kernel approach to avoid atomic operations:

dQ kernel (steel_attention_vjp_dq.h):
- Computes query gradients via outer loop over KV blocks
- Uses log2 domain for numerical stability
- Proper clamping to prevent overflow (exp2 arg clamped to [-88, 0])
dK/dV kernel (steel_attention_vjp_dkv.h):
- Uses K-row ownership model where each simdgroup owns exclusive rows
- Eliminates race conditions in GQA where multiple query heads share KV
- No atomic operations needed
Vector VJP (sdpa_vector_vjp.h):
- Optimized path for short sequences (L ≤ 8)
- Uses float32 accumulators for half/bfloat16 precision
- Shared memory reduction for efficiency

Key Features

Float32 accumulators for half/bfloat16 precision
Logsumexp caching from forward pass for VJP reuse
Proper GQA (grouped query attention) support
Causal mask support

Limitations

Falls back to unfused attention for mask/sinks gradients (per existing design)
Requires logsumexp from forward pass (training mode only)
Head dimension D=256 not supported in vector VJP (32KB threadgroup memory limit)

Test Plan

Existing test_sdpa_grad passes
New comprehensive VJP tests added:
- test_sdpa_grad_vector_path - short sequences (L=1,4,7,8)
- test_sdpa_grad_steel_path - longer sequences (L=16,32,128,256)
- test_sdpa_grad_head_dims - head dimensions (D=32,64,96,128)
- test_sdpa_grad_gqa - GQA configurations (4:1, 8:1, 16:1, MHA)
- test_sdpa_grad_dtypes - float16, bfloat16, float32
- test_sdpa_grad_edge_cases - L=1, non-power-of-2, large batch, qL≠kvL

All 21 SDPA tests pass (1 skipped for unrelated disabled feature).

Brooooooklyn · 2026-01-14T03:16:08Z

Notes: I'm working on https://github.com/mlx-node/mlx-node and trying to port some features in trl.
This pull request was generated by Claude Code. I am trying to reduce the computation and memory usage of GRPO training by utilizing the full flash attention feature.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

mlx/fast.cpp

mlx/backend/metal/kernels/steel/attn/kernels/steel_attention_vjp_dkv.h

mlx/backend/metal/kernels/sdpa_vector_vjp.h

mlx/backend/metal/scaled_dot_product_attention.cpp

mlx/backend/metal/kernels/sdpa_vector_vjp.h

mlx/backend/metal/scaled_dot_product_attention.cpp

mlx/backend/metal/kernels/steel/attn/kernels/steel_attention_vjp_dkv.h

mlx/fast_primitives.h

mlx/backend/metal/scaled_dot_product_attention.cpp

Implement fused backward pass (VJP) for scaled_dot_product_attention on Metal GPU, enabling efficient training without falling back to unfused attention. - **dQ Kernel** (steel_attention_vjp_dq.h): Computes query gradients - Outer loop over KV blocks, inner accumulation for dQ - Uses log2 domain for numerical stability - **dK/dV Kernel** (steel_attention_vjp_dkv.h): Computes key/value gradients - K-row ownership model eliminates atomic operations - Each simdgroup owns exclusive K rows to prevent races - Optimized path for short sequences (L ≤ 8) - Uses shared memory for efficient reduction - Float32 accumulators for half/bfloat16 precision - Logsumexp caching from forward pass - Proper GQA (grouped query attention) support - Causal mask support - Comprehensive test coverage for all code paths - No gradient support for mask or attention sinks (falls back to unfused) - Requires logsumexp from forward pass (training mode only) - Head dimension D=256 not supported in vector VJP (threadgroup memory) Co-Authored-By: Claude <[email protected]>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot · 2026-01-14T09:16:52Z

mlx/backend/metal/scaled_dot_product_attention.cpp

+  if (use_vector_vjp) {
+    if (needs_float32_accumulators) {
+      // Use float32 accumulator buffers with the accumulate kernel variant
+      // This variant has device float* signature for dK/dV, ensuring correct
+      // pointer arithmetic (sizeof(float)=4) instead of sizeof(T)=2 for
+      // half/bfloat16
+      array& dk_acc = dk_accum.value();
+      array& dv_acc = dv_accum.value();
+      sdpa_vector_vjp_accumulate_dispatch(
+          s,
+          d,
+          q,
+          k,
+          v,
+          o,
+          dO,
+          lse,
+          d_q,
+          dk_acc,
+          dv_acc,
+          scale_,
+          do_causal,
+          mask,
+          sinks);
+
+      // Convert float32 accumulators to original dtype
+      // This uses the standard copy primitive with type conversion
+      copy_gpu(dk_acc, d_k, CopyType::General, s);
+      copy_gpu(dv_acc, d_v, CopyType::General, s);
+
+      // Add accumulators as temporaries for cleanup
+      d.add_temporary(dk_acc, s.index);
+      d.add_temporary(dv_acc, s.index);
+    } else {
+      // Float32: pass dK/dV directly (already zero-initialized above)
+      sdpa_vector_vjp_dispatch(
+          s,
+          d,
+          q,
+          k,
+          v,
+          o,
+          dO,
+          lse,
+          d_q,
+          d_k,
+          d_v,
+          scale_,
+          do_causal,
+          mask,
+          sinks);
+    }
+  } else {
+    // Two-kernel STEEL VJP approach for longer sequences
+    // This eliminates atomic operations by using separate kernels for dQ and
+    // dK/dV:
+    // - dQ kernel: loops over KV blocks (grid [NQ, H, B])
+    // - dKV kernel: loops over Q blocks (grid [NK, H, B])
+    // Each kernel owns its output entirely, avoiding race conditions.
+
+    // STEEL VJP kernels (do not support masks/sinks)
+    // Safety assertion: use_fallback() should prevent reaching here with
+    // masks/sinks
+    assert(
+        (!has_arr_mask && !has_sinks_) &&
+        "STEEL VJP called with mask/sinks - use_fallback() should have prevented this");
+
+    // Runtime guard (active even in release builds where assert is disabled)
+    if (has_arr_mask || has_sinks_) {
+      throw std::runtime_error(
+          "Internal error: STEEL VJP called with masks/sinks. "
+          "This indicates a bug in use_fallback() logic.");
+    }
+
+    // Dispatch dQ kernel - computes only dQ gradients
+    sdpa_steel_vjp_dq_dispatch(
+        s, d, q, k, v, o, dO, lse, d_q, scale_, do_causal);
+
+    // Dispatch dKV kernel - computes dK and dV gradients directly
+    sdpa_steel_vjp_dkv_dispatch(
+        s, d, q, k, v, o, dO, lse, d_k, d_v, scale_, do_causal);
+  }
+


The variable use_vector_vjp is hardcoded to false at line 1468, which effectively disables the vector VJP path entirely. The extensive code for vector VJP (lines 1485-1587) becomes dead code. Either remove the dead code or use a proper condition to enable vector VJP when appropriate. The comment explains why it's disabled (logsumexp domain mismatch), but having 100+ lines of unreachable code is a maintainability issue.

Suggested change

if (use_vector_vjp) {

if (needs_float32_accumulators) {

// Use float32 accumulator buffers with the accumulate kernel variant

// This variant has device float* signature for dK/dV, ensuring correct

// pointer arithmetic (sizeof(float)=4) instead of sizeof(T)=2 for

// half/bfloat16

array& dk_acc = dk_accum.value();

array& dv_acc = dv_accum.value();

sdpa_vector_vjp_accumulate_dispatch(

s,

d,

q,

k,

v,

o,

dO,

lse,

d_q,

dk_acc,

dv_acc,

scale_,

do_causal,

mask,

sinks);

// Convert float32 accumulators to original dtype

// This uses the standard copy primitive with type conversion

copy_gpu(dk_acc, d_k, CopyType::General, s);

copy_gpu(dv_acc, d_v, CopyType::General, s);

// Add accumulators as temporaries for cleanup

d.add_temporary(dk_acc, s.index);

d.add_temporary(dv_acc, s.index);

} else {

// Float32: pass dK/dV directly (already zero-initialized above)

sdpa_vector_vjp_dispatch(

s,

d,

q,

k,

v,

o,

dO,

lse,

d_q,

d_k,

d_v,

scale_,

do_causal,

mask,

sinks);

}

} else {

// Two-kernel STEEL VJP approach for longer sequences

// This eliminates atomic operations by using separate kernels for dQ and

// dK/dV:

// - dQ kernel: loops over KV blocks (grid [NQ, H, B])

// - dKV kernel: loops over Q blocks (grid [NK, H, B])

// Each kernel owns its output entirely, avoiding race conditions.

// STEEL VJP kernels (do not support masks/sinks)

// Safety assertion: use_fallback() should prevent reaching here with

// masks/sinks

assert(

(!has_arr_mask && !has_sinks_) &&

"STEEL VJP called with mask/sinks - use_fallback() should have prevented this");

// Runtime guard (active even in release builds where assert is disabled)

if (has_arr_mask || has_sinks_) {

throw std::runtime_error(

"Internal error: STEEL VJP called with masks/sinks. "

"This indicates a bug in use_fallback() logic.");

}

// Dispatch dQ kernel - computes only dQ gradients

sdpa_steel_vjp_dq_dispatch(

s, d, q, k, v, o, dO, lse, d_q, scale_, do_causal);

// Dispatch dKV kernel - computes dK and dV gradients directly

sdpa_steel_vjp_dkv_dispatch(

s, d, q, k, v, o, dO, lse, d_k, d_v, scale_, do_causal);

}

// Two-kernel STEEL VJP approach for longer sequences

// This eliminates atomic operations by using separate kernels for dQ and

// dK/dV:

// - dQ kernel: loops over KV blocks (grid [NQ, H, B])

// - dKV kernel: loops over Q blocks (grid [NK, H, B])

// Each kernel owns its output entirely, avoiding race conditions.

// STEEL VJP kernels (do not support masks/sinks)

// Safety assertion: use_fallback() should prevent reaching here with

// masks/sinks

assert(

(!has_arr_mask && !has_sinks_) &&

"STEEL VJP called with mask/sinks - use_fallback() should have prevented this");

// Runtime guard (active even in release builds where assert is disabled)

if (has_arr_mask || has_sinks_) {

throw std::runtime_error(

"Internal error: STEEL VJP called with masks/sinks. "

"This indicates a bug in use_fallback() logic.");

}

// Dispatch dQ kernel - computes only dQ gradients

sdpa_steel_vjp_dq_dispatch(

s, d, q, k, v, o, dO, lse, d_q, scale_, do_causal);

// Dispatch dKV kernel - computes dK and dV gradients directly

sdpa_steel_vjp_dkv_dispatch(

s, d, q, k, v, o, dO, lse, d_k, d_v, scale_, do_causal);

Copilot AI review requested due to automatic review settings January 14, 2026 03:01

Copilot started reviewing on behalf of Brooooooklyn January 14, 2026 03:02 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

Brooooooklyn marked this pull request as draft January 14, 2026 04:27

Brooooooklyn force-pushed the flash-attn branch 4 times, most recently from 35a7886 to 568ff36 Compare January 14, 2026 08:18

Brooooooklyn force-pushed the flash-attn branch from 568ff36 to 26b5857 Compare January 14, 2026 09:03

Brooooooklyn marked this pull request as ready for review January 14, 2026 09:06

Brooooooklyn requested a review from Copilot January 14, 2026 09:12

Copilot started reviewing on behalf of Brooooooklyn January 14, 2026 09:12 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

Brooooooklyn marked this pull request as draft January 14, 2026 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Metal] Add Flash Attention VJP for training #2995

[Metal] Add Flash Attention VJP for training #2995

Brooooooklyn commented Jan 14, 2026

Uh oh!

Brooooooklyn commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Metal] Add Flash Attention VJP for training #2995

Are you sure you want to change the base?

[Metal] Add Flash Attention VJP for training #2995

Conversation

Brooooooklyn commented Jan 14, 2026

Summary

Changes

New Files

Modified Files

Implementation Notes

Key Features

Limitations

Test Plan

Uh oh!

Brooooooklyn commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant