feat: Implement Sink Attention #1819

aoxy · 2025-08-18T08:31:01Z

Description

This pull request introduces support for the Sink Attention mechanism (directly following the implementation in GPT-OSS).

This is implemented by optionally incorporating a new learnable_sink parameter into the attention functions. This parameter is a tensor of shape (nheads,), providing a learnable bias for each attention head. The sink value is added as an extra logit in the attention score calculation before softmax, allowing the model to learn to "sink" a portion of the attention to a global entry, enhancing model capacity when handling long sequences.

Key Changes:

Python Interface (flash_attn_interface.py):
Added the learnable_sink parameter to all major forward functions (flash_attn_func, flash_attn_qkvpacked_func, etc.), including their variable-length counterparts. The backward pass is updated to compute and return the gradient for learnable_sink.
C++/CUDA Kernels (csrc/):
The C++ API and CUDA kernels now accept and integrate the sink values, incorporating them into the softmax computation during the forward pass, and calculating their gradients during the backward pass. A new template parameter Has_sink is introduced to conditionally compile the sink-related logic.
Testing (tests/test_flash_attn.py):
Comprehensive tests, including new test cases with has_learnable_sink=True, have been added to validate both forward and backward passes. The numerical correctness of outputs and gradients is verified against reference implementations. All tests pass.
Benchmarks (benchmarks/benchmark_flash_attention.py):
The benchmark script is updated to add a "Flash2Sink" method for measuring the feature's performance impact.

aoxy · 2025-08-18T08:32:05Z

My current implementation approach is to add a new top-level interface flash_attn_sink_func, which results in a lot of redundant code. Would you consider adding the sink parameter to the existing interfaces instead?

tridao · 2025-08-18T15:24:49Z

It's better to add to existing interface instead of duplicating code

aoxy · 2025-08-22T02:34:30Z

Hi @tridao , I have updated the PR addressing your feedback—please kindly take another look when you have time. Thanks again for your guidance!

aoxy · 2025-08-22T02:41:47Z

I plan to add sink to the hopper version in the next PR.

flash_attn/flash_attn_interface.py

gunjunlee · 2025-08-23T10:50:02Z

It seems flash_attn_with_kvcache with sink produces incorrect results during decoding. Could you check it?

csrc/flash_attn/src/flash_fwd_kernel.h

guilhermeleobas · 2025-09-08T15:10:13Z

Hi @aoxy, will this work be also ported to work on flash attention 3 (the code on hopper/ subdir)?

aoxy · 2025-09-10T01:40:04Z

@guilhermeleobas , Yes, I also plan to port this work to FlashAttention-3.

Fix attention with sink combine_attn_seqk_parallel.

aoxy · 2025-09-19T04:37:52Z

Hi @tridao ,

Could you please help review this PR?

All tests pass. Would appreciate your feedback. Thank you!

Potatooff · 2025-10-20T21:59:18Z

Please

aoxy · 2025-10-21T06:30:33Z

Hi @tridao ,

Sorry to disturb, but may I kindly ask if there are any updates regarding the review of this PR?

Also, I wonder if you are still considering the integration of Sink Attention into Flash Attention v2.

Thank you very much!

liuqianchao · 2025-12-02T08:52:10Z

@aoxy hi, any update on the merge work?

We’ve recently run into low training efficiency when doing RL training with gpt-oss because sink attention support is inconsistent between training and inference, and we’d like to know when FA2/FA3 are expected to officially support sink attention.

For more information, you can have a look at volcengine/verl#3794

aoxy · 2025-12-02T09:19:20Z

@aoxy hi, any update on the merge work?

We’ve recently run into low training efficiency when doing RL training with gpt-oss because sink attention support is inconsistent between training and inference, and we’d like to know when FA2/FA3 are expected to officially support sink attention.

For more information, you can have a look at volcengine/verl#3794

I don't know, you could consider using an internal version.

jerryao added 6 commits August 15, 2025 11:38

Add new API.

d715844

Add tests for sink.

5501390

Right fwd.

f3b159e

Right bwd.

be46c29

Refine.

a53a8d8

Fix warp reduce.

5edbfdc

jerryao added 9 commits August 19, 2025 17:26

Add Tests.

b4fc07b

Fix max in fwd.

66b88a5

Fix init dsink.

27a005f

Add to existing interface.

26b4704

Modify bwd.

9a57df2

Fix.

4cf0f88

Fix lse in combine_attn_seqk_parallel.

9880064

Update tests.

9117304

Clean code.

46b8ee9

aoxy mentioned this pull request Aug 22, 2025

Add attention sink to flash attention NVIDIA/TransformerEngine#2070

Closed

forbidding Learnable sink and ALiBi to party together.

c00f806

aoxy force-pushed the feature/attention_with_sink branch from e34d3ad to c00f806 Compare August 22, 2025 03:50

vasqu reviewed Aug 22, 2025

View reviewed changes

flash_attn/flash_attn_interface.py Outdated Show resolved Hide resolved

learnable_sink optonal.

f5044c8

gunjunlee reviewed Aug 23, 2025

View reviewed changes

csrc/flash_attn/src/flash_fwd_kernel.h Outdated Show resolved Hide resolved

jerryao added 2 commits August 24, 2025 23:13

Fix arg learnable_sink.

793f3f5

Fix bwd with sink and dropout.

51524da

muoshuosha reviewed Sep 1, 2025

View reviewed changes

csrc/flash_attn/src/flash_fwd_kernel.h Outdated Show resolved Hide resolved

Debug for softmax LSE calculation.

a317bae

Debug for sink calculation.

594365e

aoxy and others added 8 commits September 10, 2025 19:21

Merge pull request #2 from Henry0215/feature/attention_with_sink

166b7c5

Fix attention with sink combine_attn_seqk_parallel.

Fix some bugs.

0a4ebac

Update tests.

067cf47

Fix.

408083e

Fix bugs.

f5dd38e

clean code.

b0ed042

Fix tests.

b8f9953

Merge branch 'main' into feature/attention_with_sink

8a80215

aoxy requested a review from vasqu September 19, 2025 04:39

turboderp mentioned this pull request Oct 17, 2025

GPT-OSS support turboderp-org/exllamav3#96

Open

jerryao added 2 commits October 21, 2025 10:07

Add new line.

9d099a0

Merge branch 'main' into feature/attention_with_sink

9eb63cb

vasqu mentioned this pull request Dec 1, 2025

gpt-oss is not working with flash-attention huggingface/transformers#42533

Open

4 tasks

feat: Implement Sink Attention #1819

Are you sure you want to change the base?

feat: Implement Sink Attention #1819

Uh oh!

Conversation

aoxy commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes:

Uh oh!

aoxy commented Aug 18, 2025

Uh oh!

tridao commented Aug 18, 2025

Uh oh!

aoxy commented Aug 22, 2025

Uh oh!

aoxy commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gunjunlee commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guilhermeleobas commented Sep 8, 2025

Uh oh!

aoxy commented Sep 10, 2025

Uh oh!

aoxy commented Sep 19, 2025

Uh oh!

Potatooff commented Oct 20, 2025

Uh oh!

aoxy commented Oct 21, 2025

Uh oh!

liuqianchao commented Dec 2, 2025

Uh oh!

aoxy commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

aoxy commented Aug 18, 2025 •

edited

Loading

aoxy commented Aug 22, 2025 •

edited

Loading

gunjunlee commented Aug 23, 2025 •

edited

Loading