add extra_options use_channel_wised_quantization to builder.py #1362

bopeng1234 · 2025-03-31T08:03:43Z

Add extra options to builder.py

enable quantize the model with block size = K

this PR want to work with intel/onnxruntime#631 to enable channel wised quantization capability of onnxruntime genai to generate symmetric and block_size = -1 quantized model.

with this format model, Intel NPU is able to runs x20+ speed up compared to original block size 16/32/64/128/256 models.

command:

python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1 use_qdq=1

bopeng1234 · 2025-04-07T03:19:27Z

@microsoft-github-policy-service agree

bopeng1234 · 2025-07-28T08:16:24Z

hi, can we merge this PR since intel/onnxruntime#669 has been merged

kunal-vaishnavi · 2025-07-28T20:19:03Z

I ran this PR's changes several months ago and remember encountering invalid model issues or runtime issues back then (see below for an example).

Prompt (Use quit() to exit): Hello

Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x0000700494703e49 in onnxruntime::common::Status onnxruntime::contrib::cuda::Dequantize4Bits<__half, unsigned char>(__half*, unsigned char const*, __half const*, unsigned char const*, int const*, int, int, int, CUstream_st*) () from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
(gdb) bt
#0  0x0000700494703e49 in onnxruntime::common::Status onnxruntime::contrib::cuda::Dequantize4Bits<__half, unsigned char>(__half*, unsigned char const*, __half const*, unsigned char const*, int const*, int, int, int, CUstream_st*) ()
   from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
#1  0x0000700494395c36 in onnxruntime::contrib::cuda::MatMulNBits<onnxruntime::MLFloat16>::ComputeInternal(onnxruntime::OpKernelContext*) const () from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so

Have those issues been resolved and this is working now?

kunal-vaishnavi · 2025-07-28T20:32:30Z

src/python/py/models/builder.py

                    Use this option to enable GPUs that do not support FP16 on WebGPU (e.g. GTX 10xx).
                adapter_path = Path to folder on disk containing the adapter files (adapter_config.json and adapter model weights).
                    Use this option for LoRA models.
+                use_channel_wised_quantization = Use channel wised quantization, in which block size = rows (K)


After the int4_ prefix insertion to the name, let's move this to be after the int4_algo_config so that all of the int4 extra options are grouped together. It makes it easier for a user to see the int4 extra options in one block when running python builder.py --help.

sure, changed to int4_use_channel_wised_quantization, and move positon to after int4_algo_config

kunal-vaishnavi · 2025-07-28T20:33:01Z

src/python/py/models/builder.py

                adapter_path = Path to folder on disk containing the adapter files (adapter_config.json and adapter model weights).
                    Use this option for LoRA models.
+                use_channel_wised_quantization = Use channel wised quantization, in which block size = rows (K)
+                    Use this option when you want use K as block size, default is False


Suggested change

Use this option when you want use K as block size, default is False

Use this option when you want use K as block size. Default is false.

Changed, thanks

…ize the model with block size = K

bopeng1234 · 2025-07-29T02:43:52Z

I ran this PR's changes several months ago and remember encountering invalid model issues or runtime issues back then (see below for an example).

Prompt (Use quit() to exit): Hello

Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x0000700494703e49 in onnxruntime::common::Status onnxruntime::contrib::cuda::Dequantize4Bits<__half, unsigned char>(__half*, unsigned char const*, __half const*, unsigned char const*, int const*, int, int, int, CUstream_st*) () from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
(gdb) bt
#0  0x0000700494703e49 in onnxruntime::common::Status onnxruntime::contrib::cuda::Dequantize4Bits<__half, unsigned char>(__half*, unsigned char const*, __half const*, unsigned char const*, int const*, int, int, int, CUstream_st*) ()
   from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
#1  0x0000700494395c36 in onnxruntime::contrib::cuda::MatMulNBits<onnxruntime::MLFloat16>::ComputeInternal(onnxruntime::OpKernelContext*) const () from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so

Have those issues been resolved and this is working now?

from the log, I don't know why the PR changes will cause that runtime issue, the channel-wise argument didn't touch Dequantize4Bits function.

BTW I checked the CI failure's log, the reason is this PR depend on ORT microsoft/onnxruntime@dfc27cd, which is merged in three weeks ago, 7 July. CI doesn't use latest ORT, so maybe keep the PR open until the CI uses the 1.23 ORT?

bopeng1234 mentioned this pull request Mar 31, 2025

add 4bits channel-wised quantization capability for MatMulNbits Op intel/onnxruntime#631

Closed

bopeng1234 force-pushed the main branch from 4ebd8ee to ed100a5 Compare April 1, 2025 02:56

bopeng1234 marked this pull request as draft April 8, 2025 01:36

bopeng1234 mentioned this pull request Apr 22, 2025

add channel wise quantization option for QDQ, and opt for intel NPU intel/onnxruntime#669

Merged

bopeng1234 marked this pull request as ready for review April 22, 2025 02:24

bopeng1234 force-pushed the main branch 2 times, most recently from 9ac5a6a to cc8b56a Compare May 7, 2025 01:07

kunal-vaishnavi reviewed Jul 28, 2025

View reviewed changes

bopeng1234 added 2 commits July 29, 2025 10:00

add extra_options use_channel_wised_quantization to builder.py, quant…

531a712

…ize the model with block size = K

add int_ prefix to use_channel_wised_quantization and fix typo

260e02f

bopeng1234 force-pushed the main branch from 0838869 to 260e02f Compare July 29, 2025 02:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add extra_options use_channel_wised_quantization to builder.py #1362

add extra_options use_channel_wised_quantization to builder.py #1362

Uh oh!

bopeng1234 commented Mar 31, 2025 •

edited

Loading

Uh oh!

bopeng1234 commented Apr 7, 2025

Uh oh!

bopeng1234 commented Jul 28, 2025

Uh oh!

kunal-vaishnavi commented Jul 28, 2025

Uh oh!

kunal-vaishnavi Jul 28, 2025

Uh oh!

bopeng1234 Jul 29, 2025

Uh oh!

kunal-vaishnavi Jul 28, 2025

Uh oh!

bopeng1234 Jul 29, 2025

Uh oh!

bopeng1234 commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Use this option when you want use K as block size, default is False
	Use this option when you want use K as block size. Default is false.

add extra_options use_channel_wised_quantization to builder.py #1362

Are you sure you want to change the base?

add extra_options use_channel_wised_quantization to builder.py #1362

Uh oh!

Conversation

bopeng1234 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bopeng1234 commented Apr 7, 2025

Uh oh!

bopeng1234 commented Jul 28, 2025

Uh oh!

kunal-vaishnavi commented Jul 28, 2025

Uh oh!

kunal-vaishnavi Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

bopeng1234 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

bopeng1234 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

bopeng1234 commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bopeng1234 commented Mar 31, 2025 •

edited

Loading