-
Notifications
You must be signed in to change notification settings - Fork 228
add extra_options use_channel_wised_quantization to builder.py #1362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@microsoft-github-policy-service agree |
9ac5a6a to
cc8b56a
Compare
|
hi, can we merge this PR since intel/onnxruntime#669 has been merged |
|
I ran this PR's changes several months ago and remember encountering invalid model issues or runtime issues back then (see below for an example). Have those issues been resolved and this is working now? |
src/python/py/models/builder.py
Outdated
| Use this option to enable GPUs that do not support FP16 on WebGPU (e.g. GTX 10xx). | ||
| adapter_path = Path to folder on disk containing the adapter files (adapter_config.json and adapter model weights). | ||
| Use this option for LoRA models. | ||
| use_channel_wised_quantization = Use channel wised quantization, in which block size = rows (K) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the int4_ prefix insertion to the name, let's move this to be after the int4_algo_config so that all of the int4 extra options are grouped together. It makes it easier for a user to see the int4 extra options in one block when running python builder.py --help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, changed to int4_use_channel_wised_quantization, and move positon to after int4_algo_config
src/python/py/models/builder.py
Outdated
| adapter_path = Path to folder on disk containing the adapter files (adapter_config.json and adapter model weights). | ||
| Use this option for LoRA models. | ||
| use_channel_wised_quantization = Use channel wised quantization, in which block size = rows (K) | ||
| Use this option when you want use K as block size, default is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Use this option when you want use K as block size, default is False | |
| Use this option when you want use K as block size. Default is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks
…ize the model with block size = K
from the log, I don't know why the PR changes will cause that runtime issue, the channel-wise argument didn't touch BTW I checked the CI failure's log, the reason is this PR depend on ORT microsoft/onnxruntime@dfc27cd, which is merged in three weeks ago, 7 July. CI doesn't use latest ORT, so maybe keep the PR open until the CI uses the 1.23 ORT? |
Add extra options to builder.py
enable quantize the model with block size = K
this PR want to work with intel/onnxruntime#631 to enable channel wised quantization capability of onnxruntime genai to generate symmetric and block_size = -1 quantized model.
with this format model, Intel NPU is able to runs x20+ speed up compared to original block size 16/32/64/128/256 models.
command:
python -m onnxruntime_genai.models.builder -o E:\download\onnx\Phi-3-mini-4k-instruct-onnx-channelwise-modified -p int4 -e cpu -i E:\download\huggingface\Phi-3-mini-4k-instruct --extra_options use_channel_wised_quantization=1 use_qdq=1