[SW-228042] Add support for dynamic vLLM kv-cache quantization #538

dudilester · 2025-11-06T14:17:36Z

No description provided.

github-actions · 2025-11-06T14:17:50Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Copilot

Pull Request Overview

This PR adds support for dynamic vLLM kv-cache quantization by introducing scaling tensors alongside the existing key and value cache tensors throughout the codebase.

Key changes:

Extended KV cache tuple structure from 2 elements (key, value) to 4 elements (key, value, key_scale, value_scale)
Added conditional logic to create scale tensors when FP8 quantization with dynamic scaling is enabled
Updated all cache operations (copy, swap, fetch) to handle the new scale tensors

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
vllm_gaudi/v1/worker/hpu_worker.py	Added scale tensor initialization to empty cache placeholders
vllm_gaudi/v1/worker/hpu_model_runner.py	Added conditional creation of scale tensors based on dtype and QUANT_CONFIG environment variable
vllm_gaudi/extension/utils.py	Updated cache operations to accept optional scale parameters and modernized super() calls
vllm_gaudi/extension/unified.py	Extended CacheUtils to store and manage scale tensors
vllm_gaudi/extension/ops.py	Updated flat_pa to unflatten and pass scale tensors to fetch functions
vllm_gaudi/extension/cache_ops.py	Modified copy_blocks to handle scale tensors with null checks
vllm_gaudi/attention/ops/hpu_paged_attn.py	Updated method signatures and implementations to handle 4-tuple cache structure
vllm_gaudi/attention/backends/hpu_attn.py	Threaded scale tensors through attention forward passes and helper methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/extension/cache_ops.py

vllm_gaudi/attention/ops/hpu_paged_attn.py

github-actions · 2025-11-09T10:15:48Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2025-11-09T10:31:44Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2025-11-09T10:38:08Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2025-11-12T09:29:27Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2025-11-12T13:10:04Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

michalkuligowski · 2025-11-13T09:00:54Z

vllm_gaudi/extension/utils.py

        return cache

-    def fetch_from_cache(self, cache, blocks):
+    def fetch_from_cache(self, cache, blocks, scales=None):


e2e tests fail with:

PatchedVLLMKVCache.fetch_from_cache() takes 3 positional arguments but 4 were given

Thanks @michalkuligowski , there is a related PR in INC to support this feature.
https://github.com/habana-internal/neural-compressor-fork/pull/322
I guess I need the INC PR to be merged first to make this PR pass the tests.

[SW-228042] Add support for dynamic vLLM kv-cache quantization

c9dc86a

Copilot AI review requested due to automatic review settings November 6, 2025 14:17

dudilester requested review from adobrzyn, afierka-intel, iboiko-habana, kzawora-intel, mgawarkiewicz-intel, michalkuligowski, mswiniarsk, vivekgoe and xuechendi as code owners November 6, 2025 14:17

Copilot AI reviewed Nov 6, 2025

View reviewed changes

vllm_gaudi/extension/cache_ops.py Outdated Show resolved Hide resolved

vllm_gaudi/attention/ops/hpu_paged_attn.py Outdated Show resolved Hide resolved

fix CR comments

49753b5

dudilester force-pushed the dev/dudilester/dynamic_kv branch from f3889f4 to f9b8994 Compare November 9, 2025 10:31

fix syntax comments

11a46aa

dudilester force-pushed the dev/dudilester/dynamic_kv branch from 1c66b48 to 11a46aa Compare November 9, 2025 10:37

add matmul args

fb65d49

dudilester force-pushed the dev/dudilester/dynamic_kv branch from 38a7d10 to fb65d49 Compare November 12, 2025 09:29

fix latent attention

03b767c

dudilester force-pushed the dev/dudilester/dynamic_kv branch from 7d7c612 to 03b767c Compare November 12, 2025 13:09

Merge branch 'main' into dev/dudilester/dynamic_kv

c86149e

michalkuligowski reviewed Nov 13, 2025

View reviewed changes

Merge branch 'main' into dev/dudilester/dynamic_kv

e437791

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SW-228042] Add support for dynamic vLLM kv-cache quantization #538

[SW-228042] Add support for dynamic vLLM kv-cache quantization #538

Uh oh!

dudilester commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 9, 2025

Uh oh!

github-actions bot commented Nov 9, 2025

Uh oh!

github-actions bot commented Nov 9, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

michalkuligowski Nov 13, 2025

Uh oh!

dudilester Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SW-228042] Add support for dynamic vLLM kv-cache quantization #538

Are you sure you want to change the base?

[SW-228042] Add support for dynamic vLLM kv-cache quantization #538

Uh oh!

Conversation

dudilester commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 9, 2025

🚧 CI Blocked

Uh oh!

github-actions bot commented Nov 9, 2025

🚧 CI Blocked

Uh oh!

github-actions bot commented Nov 9, 2025

🚧 CI Blocked

Uh oh!

github-actions bot commented Nov 12, 2025

🚧 CI Blocked

Uh oh!

github-actions bot commented Nov 12, 2025

🚧 CI Blocked

Uh oh!

michalkuligowski Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

dudilester Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants