Skip to content

schnell18/lm-quant-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

The lm-quant-toolkit is a suite of tools to facilitate large neural network quantization research. It includes a quantization harness tool to drive quantization experiments on large language models and vision models, to collect and summarize experiment data for further analysis. It also includes tool to prepare experiment meta data and visualization tools to interpret experiment results. Specifically, lm-quant-toolkit consists of:

  • LLM quantization harness tool
  • ViT quantization harness tool
  • FNorm Metadata Preparation Tool
  • Kurtosis Metrics Measuring Tool
  • Sensitivity Score Measuring Tool
  • Calibration Dataset Generation Tool
  • Visualization Tools

Citation

@inproceedings{zhang2025mxq,
  title = {A Mixed Quantization Approach for Data-Free Quantization of LLMs},
  author = {Feng Zhang and Yanbin Liu and Weihua Li and Xiaodan Wang and Quan Bai},
  year = {2025},
  url = {https://openreview.net/forum?id=M3Y74vmsMcY},
}

Setup test harness

Most tools are implemented in Python and are extensively tested under the Python 3.11.9. The visualization tools are implemented in R. The usages of these tools are elaborated in the following sections. This section describes how to setup the lm-quant-toolkit and the companion visualization tools.

The Python tools dependend on Python libraries such as transformers, datasets, numpy, PyTorch etc. A few Python libraries are patched to support MXQ. Specifically, required patched dependencies include AutoGPTQ (for CUDA 12.5 compatibility), HQQ (support MXQ extension), lm_eval (for end-to-end LLM performance evaluation), clip_benchmark (for vision model evaluation). These dependencies are installed automatically as part of setup process. To setup the Python tools, follow this procedure:

  • Ensure Python and miniconda are installed
  • Create a Python virtual enivonrment using Python 3.11.9 and activate this enivonrment
  • Clone the lm-quant-toolkit project from the lm-quant-toolkit project
  • Run the script setup-harness.sh under the root directory of the lm-quant-toolkit project

Or simply use the convenient script setup-harness.sh included this project.

Setup visualization tools

The visualization tools are R scripts to transform, aggregate and visualize experiment results. They are wrapped in bash scripts to automate the whole experiment loop, which consists of model quantization, perplex evaluation, memory consumption test and experiment report generation. The R visualization scripts can also be used separately. To setup the visualization tools, please follow this procedure:

  • Ensure a recent version of R, for instance R 4.4.1, is installed.
  • Optionally, RStudio could be installed to extend and trouble shoot the visualization tools in an intuitive enivonrment.
  • Install the third-party packages required by the visualization tools by running the script setup-visualization.sh under the root directory of the lm-quant-toolkit project.

Quantization tool usage

LLM Quantization Harness Tool

This tool executes various quantization tasks and runs diverse evaluation benchmarks such as perplexity, GPU memory usage, quantized model storage. It also supports end-to-end LLM performance evaluation through the integration with the lm-eval tool. This harness tool works with various state-of-art quantization methods such as GPTQ, AWQ, BitsAndBytes and HQQ, which enables a fair comparison between the proposed methods and the state-of-art baselines. Furthermore, it facilitates the complex and time-consuming benchmarking tasks by offering resumption from failed subtasks, aggregate subtask's evaluation results. Lastly, this tool provides declarative CLI interface to ease complex experiment automation through shell scripting.

FNorm Metadata Preparation Tool

This tool calculates the Frobenius norms, a.k.a FNorm, of the quantization errors of all weight matricies inside a particular large language model. The FNorm meta-data are crucial to the MXQ quantization scheme as it guides MXQ to allocate optimal quantization configurations. This tool accepts a list of Hugging Face-compliant model identifiers. The output of this tool is a series of .csv files under specified directory. Each file contains the Frobenius norms for the 12 quantization configurations.

The tool is implemented in Python and provides a convenient CLI interface to enable shell scripting. It is located separately in the dump.py file under the src folder in the lm-quant-toolkit project, which helps to reduce unnecessary dependencies. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
MODELS="meta-llama/Llama-2-7b-hf meta-llama/Llama-2-13b-hf meta-llama/Llama-3.1-8B"
mkdir -p /tmp/fnorm-dump
python $TOOLKIT_DIR/src/dump.py fnorm \
    --model $MODELS \
    --output-dir /tmp/fnorm-dump

Kurtosis Metrics Measuring Tool

This tool calculates the Kurtosis metrics of weight matricies layer-by-layer inside a particular large language model. The Kurtosis metrcis are crucial to identify sensitive layers to improve the accuracy of MXQ quantization. This tool accepts a list of Hugging Face-compliant model identifiers. The output of this tool is a series of .csv files under specified directory. Each file contains the Kurtosis metrics for corresponding models.

The tool is implemented in Python and provides a convenient CLI interface to enable shell scripting. It is included in the dump.py file under the src folder in the lm-quant-toolkit project. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

MODELS="meta-llama/Llama-2-7b-hf meta-llama/Llama-2-13b-hf meta-llama/Meta-Llama-3-8B"
mkdir -p /tmp/kurtosis-dump
python ../src/dump.py kurtosis \
    --model $MODELS \
    --output-dir /tmp/kurtosis-dump

This code snippet demonstrates dumping the kurtosis metrics for the three Llama models into the /tmp/kurtosis-dump directory.

Sensitivity Score Measuring Tool

This tool calculates the sensitivity score of each layer of a particular large language model. The sensitivity score are crucial to identify sensitive layers to improve the accuracy of MXQ quantization. This tool accepts a list of Hugging Face-compliant model identifiers. The output of this tool is a series of .csv files, each contains the sensitivity score for corresponding model. These files are crucial inputs to guide the SensiBoost and Sensitivity-based MiLP.

The tool is implemented in Python and provides a convenient CLI interface to enable shell scripting. It is compatible with any transformer-based LLMs with an implementation of the popular Hugging Face transformers library. It is located separately in the dump.py file under the src folder in the lm-quant-toolkit project, which helps to reduce unnecessary dependencies. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"
RESULT_BASE_DIR="/data/llm/mxq/results"
CALIB_DATASETS="bos pileval wikitext c4"
CONFIGS="b2g128 b2g64 b2g32 b3g128 b3g64 b3g32 b4g128 b4g64 b4g32 b8g128 b8g64 b8g32"
MODELS="Qwen/Qwen2.5-7B Qwen/Qwen2.5-Coder-7B Qwen/Qwen2.5-Coder-7B-Instruct Qwen/Qwen2.5-Math-7B"

EXP_NAME=sensi_qwen25
RESULT_DIR=$RESULT_BASE_DIR/$EXP_NAME
mkdir -p $RESULT_DIR/data

for DS in $CALIB_DATASETS; do
    for CFG in $CONFIGS; do
        for MODEL in $MODELS; do
            SHORT_ID=$(echo $MODEL | cut -d/ -f2)
            OUT_FILE="${RESULT_DIR}/data/qwen25-sensi-${SHORT_ID}-${CFG}-${DS}.csv"
            python $TOOLKIT_DIR/src/dump.py sensi \
                --model $MODEL \
                --config $CFG \
                --calib-dataset $DS \
                --output-file $OUT_FILE
        done
    done
done

The code snippet demonstrates how to calculate the sensitivity scores for a series of Qwen2.5 models using 4 calibration datasets under 12 bit budgets.

Calibration Dataset Generation Tool

This tool generates a small synthensized dataset named branch of science (denoted as BoS, published on Hugging Face), which includes a few hundred of textual defintions for science, art and business topics such as Mathematics, Physics, Chemstry, Law, Music and Journalism etc. The dataset is intended to validate if the sensitivity property generalize to diverse datasets.

The tool generates an initial dataset in .csv format which requires further processing. The output of this tool is random due to the generative nature of LLM. This tool requires a Llama-2-7B model being served with an OpenAI compatible RESTful API endpoint. User can either use a hosted API endpoint or deploy a local instance by following the instruction at the end of this section.

Once the API endpoint is secured, run the following script to generate the BoS dataset:

#!/bin/bash

TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"

$TOOLKIT_DIR/utils/generate.py \
  --model="meta-llama/Llama-2-7b-chat-hf" \
  --variant="vLLM" \
  --topic-file=topics-l1.txt \
  --trace

Lastly, find the result in the csv files under current directory.

Local API endpoint

To deploy a local API endpoint using vLLM, create a virtual environment using conda as follows:

conda create -n vllm python=3.11 -y
conda activate vllm
pip install vllm==0.6.4.post1

Then configure and launch the API server

#!/bin/bash

vllm serve meta-llama/Llama-2-7b-chat-hf --dtype auto --api-key token-abc123

Watch the output vLLm to make sure it starts up successfully.

Visualization Tool usage

The visualization tools facilitate visualizing the experiment results and the weight distribution, and generating insights of the latent features to quantize LLMs more efficiently. Most visualization tools are implemented in R and leverages the open-source plot libraries such as ggplot2, circlize, ggbreak, ggmagnify. They provide CLI interface to simplify integaration with the quantization harness tool.

These CLI tools support diverse options to allow user specify input dataset, select particular model or approach to plot. To get help on these specific CLI options, type ./plot_xxx.R --help on command line prompt. For instance, to get help on the MXQ allocation visualization tool, you may run command as follows:

./plot-mxq-allocation.R --help
Usage: ./plot-mxq-allocation.R [options]

Options:
        -h, --help
                Show this help message and exit

        -m CHARACTER, --model=CHARACTER
                Model ID

        -b DOUBLE, --budget=DOUBLE
                Bit Budget

        -d CHARACTER, --baseline_data_dir=CHARACTER
                Data directory of baseline results

        -q CHARACTER, --quant_cfg_allot_file=CHARACTER
                The combined quant config allocation csv file

        --attempt1=CHARACTER
                The first attempt to plot

        --attempt2=CHARACTER
                The second attempt to plot

        --fnorm
                Display FNorm value in the bar chart

Weight Distribution Visualization Tool

This tool enables visualizing layer-wised weight distribution of large language models. It is implemented as an R script, which provides a convenient CLI interface to enable shell scripting. Given a weight distribution metrics csv file, it produces a pdf file under the pdfs with 3x3 sub-plots of column digrams for the 9 modules in the Llama family models.

The tool is named plot-wdist-llm.R and located under the data-vis folder in the lm-quant-toolkit project. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"

$TOOLKIT_DIR/data-vis/plot-wdist-llm.R -m Llama-2-7b-hf

Perplexity vs Bit Budget Visualization Tool

This tool enables visualizing the relationship between perplexity and bit budget for diverse MXQ experiments against their baselines. The generated diagram shows how memory reduction affects perplexity, which facilitates memory-accuracy trade-off.

The tool is named plot-ppl-mem.R and located under the data-vis folder in the lm-quant-toolkit project. It accepts a csv file containing the perplexity metrics of MXQ and its baselines. The output are series of PDF files corresponding to the models defined in the input file, which are placed under the pdfs subfolder. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"

$TOOLKIT_DIR/data-vis/plot-ppl-mem.R -d data/combined.csv

Quantization Speed Comparison Visualization Tool

This tool generates column digrams to explore the quantization speed among various approaches. The tool is also implemented as an R script, which provides a convenient CLI interface to enable shell scripting. The tool is named plot-quant-speed.R and located under the data-vis folder in the lm-quant-toolkit project. Given a combined perplexity metrics csv file, it produces a column digrams with x-axis in log-scale. Similar to other tools, the PDF file is placed under the pdfs subfolder. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"

$TOOLKIT_DIR/data-vis/plot-quant-speed.R -d data/combined.csv

GPU Memory Usage Visualization Tool

This tool generates column digrams to present the actual GPU memory consumption of LLMs quantized by diverse methods. The tool is also implemented as an R script, which provides a convenient CLI interface to enable shell scripting. The tool is named plot-mem-consumption.R and located under the data-vis folder in the lm-quant-toolkit project. Given a combined perplexity metrics csv file, it produces a column digrams of GPU memory usage in Giga-byte. Similar to other tools, the PDF file is placed under the pdfs subfolder. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"

$TOOLKIT_DIR/data-vis/plot-mem-consumption.R -d data/combined.csv

Quantization Configuration Allocation Visualization Tool

This tool offers insights into the way MXQ and its variants allocate bit budget to modules and layers. The variants, a.k.a. attempt, to include in the plot are configurable. A maximium of 4 variants can be plotted in a circular layout thanks to plot library circlize \citep{zuguang_2014}. The first input expected by the tool is a combined quantization configuration allocation csv file which should include experiment outcome for diverse methods such as HQQ and MXQ. The second parameter is the directory where Frobenius norms csv files are located. The third parameter is the perplexity score csv file. The tool produces circos digram in PDF format.

The tool is named plot-circos-allot.R and located under the data-vis folder in the lm-quant-toolkit project. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

TOOLKIT_DIR="../../.."

MODELS="
Llama-3-7b-hf
Llama-3-13b-hf
Meta-Llama-3-8B
"
BUDGETS="4.25 3.51"

STOP=2
TOPM=2
for MODEL in $MODELS; do
    for BUDGET in $BUDGETS; do
        $TOOLKIT_DIR/data-vis/plot-circos-allot.R \
          --model $MODEL \
          --budget $BUDGET \
          --fnorm_data_dir $TOOLKIT_DIR/src/data/ \
          --ppl_csv_file data/combined.csv \
          --quant_cfg_allot_file data/quant-cfg-allocation.csv \
          --attempt1 sensi-boost-${STOP}-${TOPM} \
          --attempt2 kurt-boost-${STOP}-${TOPM} \
          --attempt3 hqq\
          --attempt4 mxq1
    done
done

This code snippet demonstrates how to generate a quant config allocation comparison diagram to examine the nuanced difference between the SensiBoost and kurtBoost approaches, with a stop of 2 and top-{m} 2, as well as the HQQ and MXQ baselines.

SensiBoost/KurtBoost Win-Tie-Loss Visualization Tool

This tool enables qualitative analysis of effectiveness of the proposed SensiBoost and KurtBoost methods. It is implemented as an R script, which provides a conventional CLI interface to ease automation. Given a combined perplexity metrics csv file, it produces a series of column digrams in PDF format. The csv file should include experiment outcome for SensiBoost, KurtBoost, the ablation tests or baseline such as HQQ and MXQ. The name experiment, a.k.a. attempt, should follow the pattern <method>-<stop>-<top m>.

This tool is included in the lm-quant-toolkit under data-vis folder. A typical usage is demonstrated in the code snippet as follows:

#!/bin/bash

TOOLKIT_DIR="$HOME/work/lm-quant-toolkit"

$TOOLKIT_DIR/data-vis/plot-win-tie-loss.R -f data/combined.csv

About

LLM Quantization toolkit

Resources

License

Stars

Watchers

Forks

Packages

No packages published