Skip to content

Conversation

@pralay-das
Copy link

@pralay-das pralay-das commented Nov 10, 2025

In this PR

  1. change the interface of chunk_prefill execution engine, currently it is more compatible with cuda cutlass implementation
  2. support for cutlass_mla_get_workspace_size op
  3. verified changes with test file

cmd: python -m pytest tests/test_flash_attention.py
result: 96 passed, 182 skipped, 1 warning in 3.43s

@pralay-das pralay-das force-pushed the dev/pralay/cutlass_mla_get_workspace_size branch from ab5e654 to cf3362a Compare November 10, 2025 08:26
@pralay-das pralay-das force-pushed the dev/pralay/cutlass_mla_get_workspace_size branch from cf3362a to f5dde3b Compare November 10, 2025 08:29
// need to change these parameters to match the actual use case
// will do later

return MlaFlashAttnType::Kernel::get_workspace_size(arguments);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have specific WS sizes for mla, this is returning zero size. Or do we really need this API for our implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, I don't have any specific WS sizes right now, which can work for both mla and chunked_prefill...
I have seen in the test case it is calling this function before calling mla function, and simple returning zero is working fine with the test case...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, let's come up with right sizes for different configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants