Name		Name	Last commit message	Last commit date
parent directory ..
NV_Ammo		NV_Ammo
readme.md		readme.md
readme_en.md		readme_en.md

readme.md

中文 | English

Quantization

Overview

量化通过降低权重和/或激活的数值精度来减小模型大小、提高推理吞吐。核心 trade-off 是精度损失 vs 性能/内存收益。

Contents

Topic	Description	Status
Weight-Only Quantization	GPTQ, AWQ, SpQR — 只量化权重，激活保持 FP16	框架已有
Weight + Activation Quantization	SmoothQuant, LLM.int8 — W8A8 同时量化	框架已有
FP8 Quantization	E4M3/E5M2 PTQ, per-tensor/per-channel scaling	→ 参见 04-GEMM/FP8
QAT (Quantization-Aware Training)	LLM-QAT, data-free distillation	框架已有
NV Ammo / TensorRT Model Optimizer	NVIDIA 量化工具链	已有
PEFT & LoRA	→ 参见 12-rl-and-alignment	交叉引用

Weight-Only Quantization

TODO: 每个方法展开为独立分析页（原理、实现、精度影响、适用场景）

AWQ (Activation-Aware Weight Quantization)

根据激活幅度选择 ~1% salient weights，通过 per-channel scaling 保护
W4A16: 适合 memory-bound 推理场景（小 batch, 低延迟）
工具: llm-awq, AutoAWQ
集成: TRT-LLM, vLLM, Intel Neural Compressor

GPTQ

W4A16 混合量化: 权重 INT4, 激活 FP16, 推理时动态反量化
基于 block-wise 逐参数量化 + 误差补偿，需要校准数据集
工具: AutoGPTQ

LLM.int8()

矩阵分解: 绝大部分权重用 INT8 量化，少数 outlier 维度保留 FP16
6.7B 参数以上模型出现大量异常值特征，是 INT8 量化精度下降的核心原因
工具: bitsandbytes

Weight + Activation Quantization

TODO: 深入分析 SmoothQuant 实现机制和 kernel 设计

SmoothQuant

W8A8: 通过 per-channel smoothing 将激活 outlier 迁移到权重维度
Compute-bound 场景（大 batch, 高吞吐）优于 AWQ
集成: TRT-LLM, Intel Neural Compressor, AMD CK

SmoothQuant vs AWQ

	SmoothQuant (W8A8)	AWQ (W4A16)
最佳场景	Compute-bound (大 batch)	Memory-bound (小 batch)
精度影响	较小	中等
推理加速	计算密集型场景显著	带宽受限场景显著

FP8 Quantization

详见 04-gemm-and-precision/FP8:

E4M3: 推理首选（更高精度）
E5M2: 训练 backward 首选（更大动态范围）
Per-tensor / per-channel / per-tile scaling 策略

QAT

TODO: LLM-QAT 详细分析，data-free distillation 机制

LLM-QAT

同时量化 weights, activations, KV cache
通过模型自身生成数据做 knowledge distillation（data-free）
KV cache 量化对长序列吞吐至关重要

Reference