Skip to content

Conversation

@bopeng1234
Copy link

@bopeng1234 bopeng1234 commented Mar 31, 2025

Add extra options to builder.py

enable quantize the model with block size = K

this PR want to work with intel/onnxruntime#631 to enable channel wised quantization capability of onnxruntime genai to generate symmetric and block_size = -1 quantized model.

with this format model, Intel NPU is able to runs x20+ speed up compared to original block size 16/32/64/128/256 models.

command:

python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1 use_qdq=1

@bopeng1234
Copy link
Author

@microsoft-github-policy-service agree

@bopeng1234 bopeng1234 marked this pull request as draft April 8, 2025 01:36
@bopeng1234 bopeng1234 marked this pull request as ready for review April 22, 2025 02:24
@bopeng1234 bopeng1234 force-pushed the main branch 2 times, most recently from 9ac5a6a to cc8b56a Compare May 7, 2025 01:07
@bopeng1234
Copy link
Author

hi, can we merge this PR since intel/onnxruntime#669 has been merged

@kunal-vaishnavi
Copy link
Contributor

I ran this PR's changes several months ago and remember encountering invalid model issues or runtime issues back then (see below for an example).

Prompt (Use quit() to exit): Hello

Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x0000700494703e49 in onnxruntime::common::Status onnxruntime::contrib::cuda::Dequantize4Bits<__half, unsigned char>(__half*, unsigned char const*, __half const*, unsigned char const*, int const*, int, int, int, CUstream_st*) () from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
(gdb) bt
#0  0x0000700494703e49 in onnxruntime::common::Status onnxruntime::contrib::cuda::Dequantize4Bits<__half, unsigned char>(__half*, unsigned char const*, __half const*, unsigned char const*, int const*, int, int, int, CUstream_st*) ()
   from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
#1  0x0000700494395c36 in onnxruntime::contrib::cuda::MatMulNBits<onnxruntime::MLFloat16>::ComputeInternal(onnxruntime::OpKernelContext*) const () from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so

Have those issues been resolved and this is working now?

Use this option to enable GPUs that do not support FP16 on WebGPU (e.g. GTX 10xx).
adapter_path = Path to folder on disk containing the adapter files (adapter_config.json and adapter model weights).
Use this option for LoRA models.
use_channel_wised_quantization = Use channel wised quantization, in which block size = rows (K)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the int4_ prefix insertion to the name, let's move this to be after the int4_algo_config so that all of the int4 extra options are grouped together. It makes it easier for a user to see the int4 extra options in one block when running python builder.py --help.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, changed to int4_use_channel_wised_quantization, and move positon to after int4_algo_config

adapter_path = Path to folder on disk containing the adapter files (adapter_config.json and adapter model weights).
Use this option for LoRA models.
use_channel_wised_quantization = Use channel wised quantization, in which block size = rows (K)
Use this option when you want use K as block size, default is False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Use this option when you want use K as block size, default is False
Use this option when you want use K as block size. Default is false.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks

@bopeng1234
Copy link
Author

I ran this PR's changes several months ago and remember encountering invalid model issues or runtime issues back then (see below for an example).

Prompt (Use quit() to exit): Hello

Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x0000700494703e49 in onnxruntime::common::Status onnxruntime::contrib::cuda::Dequantize4Bits<__half, unsigned char>(__half*, unsigned char const*, __half const*, unsigned char const*, int const*, int, int, int, CUstream_st*) () from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
(gdb) bt
#0  0x0000700494703e49 in onnxruntime::common::Status onnxruntime::contrib::cuda::Dequantize4Bits<__half, unsigned char>(__half*, unsigned char const*, __half const*, unsigned char const*, int const*, int, int, int, CUstream_st*) ()
   from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
#1  0x0000700494395c36 in onnxruntime::contrib::cuda::MatMulNBits<onnxruntime::MLFloat16>::ComputeInternal(onnxruntime::OpKernelContext*) const () from /opt/conda/lib/python3.11/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so

Have those issues been resolved and this is working now?

from the log, I don't know why the PR changes will cause that runtime issue, the channel-wise argument didn't touch Dequantize4Bits function.

BTW I checked the CI failure's log, the reason is this PR depend on ORT microsoft/onnxruntime@dfc27cd, which is merged in three weeks ago, 7 July. CI doesn't use latest ORT, so maybe keep the PR open until the CI uses the 1.23 ORT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants