-
Notifications
You must be signed in to change notification settings - Fork 69
[SW-228042] Add support for dynamic vLLM kv-cache quantization #538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[SW-228042] Add support for dynamic vLLM kv-cache quantization #538
Conversation
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for dynamic vLLM kv-cache quantization by introducing scaling tensors alongside the existing key and value cache tensors throughout the codebase.
Key changes:
- Extended KV cache tuple structure from 2 elements (key, value) to 4 elements (key, value, key_scale, value_scale)
- Added conditional logic to create scale tensors when FP8 quantization with dynamic scaling is enabled
- Updated all cache operations (copy, swap, fetch) to handle the new scale tensors
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_worker.py | Added scale tensor initialization to empty cache placeholders |
| vllm_gaudi/v1/worker/hpu_model_runner.py | Added conditional creation of scale tensors based on dtype and QUANT_CONFIG environment variable |
| vllm_gaudi/extension/utils.py | Updated cache operations to accept optional scale parameters and modernized super() calls |
| vllm_gaudi/extension/unified.py | Extended CacheUtils to store and manage scale tensors |
| vllm_gaudi/extension/ops.py | Updated flat_pa to unflatten and pass scale tensors to fetch functions |
| vllm_gaudi/extension/cache_ops.py | Modified copy_blocks to handle scale tensors with null checks |
| vllm_gaudi/attention/ops/hpu_paged_attn.py | Updated method signatures and implementations to handle 4-tuple cache structure |
| vllm_gaudi/attention/backends/hpu_attn.py | Threaded scale tensors through attention forward passes and helper methods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
f3889f4 to
f9b8994
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1c66b48 to
11a46aa
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
38a7d10 to
fb65d49
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
7d7c612 to
03b767c
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
| return cache | ||
|
|
||
| def fetch_from_cache(self, cache, blocks): | ||
| def fetch_from_cache(self, cache, blocks, scales=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e2e tests fail with:
PatchedVLLMKVCache.fetch_from_cache() takes 3 positional arguments but 4 were given
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @michalkuligowski , there is a related PR in INC to support this feature.
https://github.com/habana-internal/neural-compressor-fork/pull/322
I guess I need the INC PR to be merged first to make this PR pass the tests.
No description provided.