-
Notifications
You must be signed in to change notification settings - Fork 295
feat: auto-select best attention backend when not specified #1185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When users don't explicitly specify an attention backend (--llm_prefill_att_backend or --llm_decode_att_backend), LightLLM now automatically selects the best available backend with priority: FA3 > FlashInfer > Triton. This improves user experience by removing the need to manually configure backends while ensuring optimal performance based on available hardware and packages.
Summary of ChangesHello @sufubao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces an intelligent auto-selection mechanism for attention backends within LightLLM. Instead of requiring users to manually specify a backend, the system will now automatically detect and utilize the most performant available option, prioritizing FA3, then FlashInfer, and falling back to Triton. This change streamlines configuration and ensures that the application leverages the best possible hardware acceleration for attention operations, enhancing overall efficiency and user experience. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an automatic selection mechanism for the attention backend, which is a great feature for improving performance out-of-the-box. The implementation correctly prioritizes backends (FA3 > FlashInfer > Triton) and includes availability and runtime validation checks. The changes to the CLI arguments are also correct. My review includes a couple of suggestions to make the validation logic more robust and efficient: one to improve error reporting in the availability check, and another to reduce memory usage during the runtime validation of the flashinfer backend. Overall, the changes are well-implemented.
| except Exception: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad except Exception: can hide underlying issues and make debugging difficult. For instance, a CUDA error within is_hopper() would be silently ignored. It's better to catch specific exceptions or at least log the error for better diagnostics.
| except Exception: | |
| return False | |
| except Exception as e: | |
| logger.debug(f"FA3 availability check failed: {e}") | |
| return False |
| import flashinfer # noqa: F401 | ||
|
|
||
| # Try creating a minimal workspace buffer to verify flashinfer works | ||
| _ = torch.empty(128 * 1024 * 1024, dtype=torch.uint8, device="cuda") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocating 128MB of CUDA memory just for a runtime validation check seems excessive and can cause an unnecessary memory spike. A much smaller allocation should be sufficient to verify that flashinfer can initialize correctly on the CUDA device.
| _ = torch.empty(128 * 1024 * 1024, dtype=torch.uint8, device="cuda") | |
| _ = torch.empty(1 * 1024 * 1024, dtype=torch.uint8, device="cuda") |
Summary
--llm_prefill_att_backendand--llm_decode_att_backendfromtritontoNone(auto-select)Changes
_auto_select_backend()function with helper functions to check FA3/FlashInfer availability and validate backends at runtimeTest plan
--llm_prefill_att_backend triton)