Skip to content

Conversation

@dudilester
Copy link

No description provided.

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for dynamic vLLM kv-cache quantization by introducing scaling tensors alongside the existing key and value cache tensors throughout the codebase.

Key changes:

  • Extended KV cache tuple structure from 2 elements (key, value) to 4 elements (key, value, key_scale, value_scale)
  • Added conditional logic to create scale tensors when FP8 quantization with dynamic scaling is enabled
  • Updated all cache operations (copy, swap, fetch) to handle the new scale tensors

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_worker.py Added scale tensor initialization to empty cache placeholders
vllm_gaudi/v1/worker/hpu_model_runner.py Added conditional creation of scale tensors based on dtype and QUANT_CONFIG environment variable
vllm_gaudi/extension/utils.py Updated cache operations to accept optional scale parameters and modernized super() calls
vllm_gaudi/extension/unified.py Extended CacheUtils to store and manage scale tensors
vllm_gaudi/extension/ops.py Updated flat_pa to unflatten and pass scale tensors to fetch functions
vllm_gaudi/extension/cache_ops.py Modified copy_blocks to handle scale tensors with null checks
vllm_gaudi/attention/ops/hpu_paged_attn.py Updated method signatures and implementations to handle 4-tuple cache structure
vllm_gaudi/attention/backends/hpu_attn.py Threaded scale tensors through attention forward passes and helper methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

github-actions bot commented Nov 9, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@dudilester dudilester force-pushed the dev/dudilester/dynamic_kv branch from f3889f4 to f9b8994 Compare November 9, 2025 10:31
@github-actions
Copy link

github-actions bot commented Nov 9, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@dudilester dudilester force-pushed the dev/dudilester/dynamic_kv branch from 1c66b48 to 11a46aa Compare November 9, 2025 10:37
@github-actions
Copy link

github-actions bot commented Nov 9, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@dudilester dudilester force-pushed the dev/dudilester/dynamic_kv branch from 38a7d10 to fb65d49 Compare November 12, 2025 09:29
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@dudilester dudilester force-pushed the dev/dudilester/dynamic_kv branch from 7d7c612 to 03b767c Compare November 12, 2025 13:09
@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

return cache

def fetch_from_cache(self, cache, blocks):
def fetch_from_cache(self, cache, blocks, scales=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e tests fail with:

PatchedVLLMKVCache.fetch_from_cache() takes 3 positional arguments but 4 were given

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @michalkuligowski , there is a related PR in INC to support this feature.
https://github.com/habana-internal/neural-compressor-fork/pull/322
I guess I need the INC PR to be merged first to make this PR pass the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants