Skip to content

Conversation

@bopeng1234
Copy link

@bopeng1234 bopeng1234 commented Apr 22, 2025

Description

add 4bits channel-wised quantization capability for DequantizeLinear Op for phi3 model, it optimized the TPS on Intel NPU

JIRA - https://jira.devtools.intel.com/browse/EISW-163602

Motivation and Context

As Intel's NPU support to LLM shows https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support

if we want to use onnx quantized model in intel NPU, like phi3, the quantized model need to meet two requirements:

  1. symmetric, zp=0
  2. channel wised quantize, block_size = K

So this PR's changes is to enable the channel wised quantize, and symmetric.

we tested it with onnxruntime_genai changes (we created a PR to onnxruntime_genai too to support this extra args, microsoft/onnxruntime-genai#1362). and openvino changes openvinotoolkit/openvino#30265

command:
python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified-QDQ-T -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1 use_qdq=1

normally without channel-wised quantize model, the phi3 with NPUW, runs about 4000ms per token when kv cache model.
apply this PR, and phi3 with NPUW, runs about 150ms per token, speed up 20+ times.

@bopeng1234
Copy link
Author

@ankitm3k , create this new one, only add the QDQ CW, removed Qoperater related code.

@ankitm3k
Copy link

@bopeng1234 kindly resolve conflicts

@bopeng1234 bopeng1234 force-pushed the ovep-develop-dev branch 3 times, most recently from a435ea0 to 9194c4f Compare May 9, 2025 04:40
Copy link

@ankitm3k ankitm3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed & tested the changes, LGTM

@ankitm3k ankitm3k merged commit 8d2f3c4 into intel:ovep-develop May 14, 2025
3 of 5 checks passed
ankitm3k pushed a commit that referenced this pull request Jul 2, 2025
…669)

* add channel wise quantization option for QDQ, it optimize for intel NPU

* add channel_wised_quantize args to MatMulNBitsQuantizer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants