Support of FP8 chunk prefill #17

adityachatter · 2025-10-17T09:28:41Z

Adds functional support of FP8 Chunk Prefill kernel
Supports FP8 E4M3FN and E5M2 datatype. Expects Q, K, V to be in FP8 precision and descale factors for Q, K, V to be in FP32 precision with shape (batch size, number of KV heads)

Run FP8 Chunk Prefill unit tests:

cd sgl-kernel-xpu/tests
python3 -m pytest -v -s test_flash_attention.py -k dtype1

96 passed, 182 skipped, 278 deselected

Signed-off-by: Aditya Chatterjee <[email protected]>

src/sycl/chunked_prefill.cpp

src/sycl/comm/fp8_descale.h

src/sycl/kernels/chunk_prefill/fp8_descale.h

tests/test_flash_attention.py

Signed-off-by: Aditya Chatterjee <[email protected]>

kareemshaik80

LGTM

pengzhao-intel · 2025-11-08T20:18:10Z

@adityachatter do you align this with framework team for the datatype of Q, K, V and scale datatype?

adityachatter · 2025-11-10T05:02:45Z

@pengzhao-intel
Yes, we have confirmed with the framework team on the dtype requirements for Q,K, V and scale factors.

pengzhao-intel · 2025-11-10T05:22:08Z

@pengzhao-intel Yes, we have confirmed with the framework team on the dtype requirements for Q,K, V and scale factors.

does this PR only for CRI or both BMG and CRI?

adityachatter · 2025-11-10T05:32:46Z

does this PR only for CRI or both BMG and CRI?

This is for BMG. Tested on B580 GPU.

Signed-off-by: Aditya Chatterjee <[email protected]>

src/sycl/kernels/chunk_prefill/xe_flash_attn_chunk_prefill_mma.hpp

airMeng · 2025-11-11T00:23:28Z

src/sycl/comm/fp8_descale.h

+CUTLASS_DEVICE uint16_t fp8_e4m3_to_bf16_bitwise(uint8_t const& src) {
+  // E4M3 (1-4-3) constants
+  constexpr uint32_t e4m3_exp_bias = 7;
+  // BFLOAT16 (1-8-7) constants
+  constexpr uint32_t bf16_exp_bias = 127;
+
+  // Unpack FP8 bits
+  uint16_t sign = static_cast<uint16_t>(src & 0x80);
+  uint16_t exponent = static_cast<uint16_t>(src & 0x78) >> 3;
+  uint16_t mantissa = static_cast<uint16_t>(src & 0x07);
+
+  // Reconstruct BFLOAT16 bits
+  uint16_t bf16_sign = sign << 8;
+  // Re-bias exponent and shift to BFLOAT16 position
+  uint16_t bf16_exponent = (exponent - e4m3_exp_bias + bf16_exp_bias) << 7;
+  // Shift mantissa to BFLOAT16 position
+  uint16_t bf16_mantissa = mantissa << 4;
+
+  return bf16_sign | bf16_exponent | bf16_mantissa;
+}


Have you tried the inline asm from https://github.com/intel/sycl-tla/blob/887362d3e5b4b038a50d9cf11b0caeb64dec86e2/include/cute/arch/reorder_xe.hpp#L375 ? The scalar conversion here is inefficient

We will include the asm reorder as part of moving FP8 support to the rearch in a later pull request.

With the new API, 1 lines of code could serve for the same purpose, saving a lot of reviewing and refactoring effort with this huge function.

sunjiweiswift

After refactoring chunkprefill kernel with the new API, you need to adapt it to use the new API to support FP8.

sunjiweiswift · 2025-11-11T01:59:04Z

How about performance vs. BF16

src/sycl/chunked_prefill.cpp

mingfeima · 2025-11-11T02:55:33Z

does this PR only for CRI or both BMG and CRI?

This is for BMG. Tested on B580 GPU.

why we are doing Q with fp8 on BMG, makes no sense.

src/sycl/comm/fp8_descale.h

mingfeima · 2025-11-11T03:07:10Z

@adityachatter

for functional enabling: it is OK to skip performance test, as long as the test case coverage is good enough
for performance optimization: it is essential to provide performance data to prove the improvements.

guoyejun · 2025-11-12T02:04:30Z

descale factors for Q, K, V to be in FP32 precision with shape (batch size, number of KV heads)

curious why we need different scales for different batch element? How the scales are generated for different batch element?

Signed-off-by: Aditya Chatterjee <[email protected]>

adityachatter · 2025-11-12T07:37:16Z

why we are doing Q with fp8 on BMG, makes no sense.

The requirement of FP8 Q came from the framework team.

How about performance vs. BF16

This is for functional FP8 support.
Optimized support is blocked on issues tracked internally and will be included in a later patch with the rearch.

curious why we need different scales for different batch element? How the scales are generated for different batch element?

Dynamic quantization is used to generate scales for each batch.
During inference, each batch may correspond to a different request (different input sequence length/padded) so activation magnitude will vary and per batch results in lower quantization error.

mingfeima

Q IS BFLOAT16!

mingfeima · 2025-11-12T07:44:47Z

IN ORDER TO LAND THIS PR, YOU NEED TO PROVIDE PERFORMANCE DATA.

guoyejun · 2025-11-12T09:29:03Z

src/sycl/comm/fp8_descale.h

+template <typename Encoding, int VectorizeSize = 8, typename SrcTensor, typename DstTensor>
+CUTLASS_DEVICE void convert_and_descale(SrcTensor const& src, DstTensor& dst, float scale) {
+  using SrcVec_u8 = sycl::vec<uint8_t, VectorizeSize>;
+  using DstVec_u16 = sycl::vec<uint16_t, VectorizeSize>;


are the dtypes of src and dst fixed as uint8_t and uint16_t? If yes, we may refine typename SrcTensor and typename DstTensor which do not contain info about uint8 and uint16.

guoyejun · 2025-11-12T09:33:09Z

src/sycl/comm/fp8_descale.h

+      result_vec_u16[j] = reinterpret_cast<uint16_t const&>(scaled_bf16);
+    }
+
+    // 5. Store the final vector of bits


// 5. Store as bits
// 5. Store the final vector of bits

two 5

guoyejun · 2025-11-12T12:37:57Z

src/sycl/kernels/chunk_prefill/xe_flash_attn_chunk_prefill_mma.hpp

+          convert_and_descale<ElementQ>(tCrQ, tCrQ_bf16, q_scale);
+        } else {
+          // If Q is already FP16, copy it.
+          copy(tCrQ, tCrQ_bf16);


is this a duplicate work for copy?

Valentine233 · 2025-11-13T06:58:01Z

tests/test_flash_attention.py

    # batch_size = 2
    # nheads = 1
    nheads_kv = nheads if mha_type == "mha" else (2 if mha_type == "gqa" else 1)
-    dtype_ref = torch.bfloat16 if dtype == torch.float8_e4m3fn else dtype


This test test_flash_attn_varlen_output is invalid, as its skipif is always true.

adityachatter and others added 4 commits October 17, 2025 09:26

Support of FP8 chunk prefill

a445e48

Signed-off-by: Aditya Chatterjee <[email protected]>

Merge branch 'main' into achatter/fp8_chunk_prefill

1383328

Merge branch 'main' into achatter/fp8_chunk_prefill

fca5a2a

Rebased restructured code

06ae0d8

Signed-off-by: Aditya Chatterjee <[email protected]>

adityachatter force-pushed the achatter/fp8_chunk_prefill branch from d20fff8 to 06ae0d8 Compare October 27, 2025 07:08

adityachatter added 3 commits October 28, 2025 03:04

initial fix

4efefc1

Signed-off-by: Aditya Chatterjee <[email protected]>

fixed fp8 accuracy

6a54149

Signed-off-by: Aditya Chatterjee <[email protected]>

update test code

49c2212

Signed-off-by: Aditya Chatterjee <[email protected]>

adityachatter marked this pull request as ready for review October 29, 2025 08:49

sunjiweiswift added the run-ci label Oct 31, 2025

adityachatter added 3 commits November 5, 2025 05:13

trigger CI

4cfd040

Fix format

7a123cd

Signed-off-by: Aditya Chatterjee <[email protected]>

fixed typo

7a6c59b

Signed-off-by: Aditya Chatterjee <[email protected]>

deepvars self-requested a review November 6, 2025 04:31