Model Quantization

2025-04-19

Model Quantization

量化是一种用来减少模型大小和加速推理的技术。主要是通过把模型权重从较高精度的数据类型如FP32转换为较低精度的数据类型如FP16/INT8来实现的。
当然也有更加激进的量化方式，比如把权重量化到INT4，甚至是二值化，BitNet恐怖如斯。

这里可能要注意一下，量化到INT8和FP8是不一样的，INT8是整数，FP8是浮点数，两者的表示范围和精度是不一样的。一般来说INT8的性能吞吐量会更好，整数运算比浮点数运算更快，功耗更低，但是表示范围更小。

把模型量化有好处也有坏处

好处

减少模型大小：量化后的模型占用的内存更小，可以更快地加载到内存中，减少了内存占用。
加速推理：量化后的模型可以在更快的硬件上运行，比如在GPU上，量化后的模型可以更快地进行矩阵乘法运算。
降低功耗：量化后的模型可以在更低功耗的硬件上运行，比如在移动设备上，量化后的模型可以更长时间地运行。

坏处

精度损失：量化后的模型可能会有一定的精度损失，特别是在量化到INT8时，精度损失可能会比较大。
复杂度增加：量化后的模型可能会增加一些额外的计算，比如量化和反量化的过程，这会增加模型的复杂度。

量化的方法

对称量化

计算scale

反量化

import numpy as np

# 原始浮点向量
x = np.array([ -1.23, 0.0, 0.75, 2.5, -0.5 ], dtype=np.float32)

# 1. 计算 scale
scale_int8 = np.max(np.abs(x)) / 127.0

# 2. 量化到 int8
q_int8 = np.round(x / scale_int8)
q_int8 = np.clip(q_int8, -127, 127).astype(np.int8)

# 3. 反量化回浮点
x_dequant_int8 = q_int8.astype(np.float32) * scale_int8

print("原始 x:         ", x)
print("量化后 q_int8:  ", q_int8)
print("反量化 x̂:      ", x_dequant_int8)

非对称量化

计算scale和zero_point，原始tensor range为[x_min, x_max],量化后的tensor range为[q_min, q_max]

import numpy as np

# 原始浮点向量
x = np.array([-1.23, 0.0, 0.75, 2.5, -0.5], dtype=np.float32)

# 定义量化范围
q_min, q_max = -128, 127

# 1. 计算 scale 和 zero_point
x_min, x_max = x.min(), x.max()
scale = (x_max - x_min) / (q_max - q_min)
zero_point = np.round(q_min - x_min / scale).astype(np.int32)
zero_point = int(np.clip(zero_point, q_min, q_max))

# 2. 量化
q_asym = np.round(x / scale + zero_point)
q_asym = np.clip(q_asym, q_min, q_max).astype(np.int8)

# 3. 反量化
x_dequant = (q_asym.astype(np.float32) - zero_point) * scale

print("x_min, x_max =", x_min, x_max)
print("scale =", scale, "zero_point =", zero_point)
print("量化后 q_asym =", q_asym)
print("反量化 x̂ =", x_dequant)

graph TB
    subgraph "对称量化"
    A[原始浮点数据] --> B[计算scale]
    B -->|"$$scale = \frac{max(abs(x))}{127}$$"| C[量化]
    C -->|"$$x_{quant} = clip(round(x / scale), -128, 127)$$"| D[量化后的整数数据]
    D --> E[反量化]
    E -->|"$$x_{dequant} = x_{quant} * scale$$"| F[反量化后的浮点数据]
    end

    style A fill:#02f,stroke:#333,stroke-width:2px
    style F fill:#03b,stroke:#333,stroke-width:2px
    style D fill:#04d,stroke:#333,stroke-width:2px

graph TB
    subgraph "非对称量化"
    G[原始浮点数据] --> H[计算scale和zero_point]
    H -->|"$$scale = \frac{x_{max} - x_{min}}{q_{max} - q_{min}}$$"| I[量化]
    H -->|"$$zero_{point} = round(\frac{x_{min}}{scale} - q_{min})$$"| I
    I -->|"$$x_{quant} = clip(round(x / scale + zero_{point}), q_{min}, q_{max})$$"| J[量化后的整数数据]
    J --> K[反量化]
    K -->|"$$x_{dequant} = (x_{quant} - zero_{point}) * scale$$"| L[反量化后的浮点数据]
    end
    style G fill:#02f,stroke:#333,stroke-width:2px
    style L fill:#03b,stroke:#333,stroke-width:2px
    style J fill:#04d,stroke:#333,stroke-width:2px

Model Quantization Practice

flowchart TD
    A([开始]) --> B[选择量化方案

• INT8/FP8
• 对称/非对称
• per‑tensor/per‑channel]
    B --> C[准备校准数据
（代表性样本）]
    C --> D[计算量化参数
scale & zero_point]
    D --> E[权重量化
W → Q_W]
    E --> F[激活量化
插入 FakeQuant（PTQ）或训练中模拟（QAT）]
    F --> G{精度是否满足要求？}
    G -- 是 --> H[导出与部署模型
ONNX/TensorRT/TFLite/...]
    G -- 否 --> I[量化感知训练（QAT）
微调 1–5 epochs]
    I --> F
    H --> J[评估与调优
• 准确率
• 吞吐/延迟]
    J --> K([结束])

这里对PTQ的QAT做一些补充：
后训练量化（PTQ）

定义：在模型训练完成后，利用一小部分校准数据（通常数百到千张样本）计算各层权重和激活的量化参数（scale、zero‑point），然后将浮点模型转换为低位宽整数模型。
优点：
- 快速，无需再训练或微调
- 工程实现简单，适合资源受限场景
缺点：
- 量化误差较大，尤其对敏感层或小模型精度损失明显
- 对非对称分布或长尾分布不够鲁棒

校准集的作用

校准集在静态量化过程中主要有以下几个方面的作用：

估计激活分布：
- 通过校准集”跑一次”模型前向传播，让Observer收集各节点的数值统计信息。
- 这些统计信息用于后续计算最优的量化参数。
确定量化参数：
- 利用收集到的统计信息计算scale和zero-point。
- 可采用Min-Max或直方图等方法，平衡精度和动态范围。
代表性样本：
- 校准集应覆盖目标应用的典型输入，确保量化后模型在实际场景中表现良好。
- 通常100-1000张样本即可，权衡统计稳定性和校准时间。
静态vs动态量化：
- 静态量化需要校准集，但推理更快。
- 动态量化无需校准，但每次推理都要计算激活的量化参数。

动态量化的Example如下，其每次的scale和zero-point是不一样的，因此需要在推理时根据当前batch计算。

import numpy as np

# —— 1. 准备浮点输入激活 x_fp32 和离线量化权重 w_fp32 —— 
batch, in_dim, out_dim = 2, 16, 8
x_fp32 = np.random.randn(batch, in_dim).astype(np.float32) * 5.0
w_fp32 = np.random.randn(in_dim, out_dim).astype(np.float32) * 0.5

# —— 2. DynamicQuantizeLinear（运行时对激活做量化） —— 
# 2.1 采集本次 batch 的 min/max
x_min, x_max = x_fp32.min(), x_fp32.max()
qmin, qmax = -128, 127

# 2.2 计算 scale 和 zero_point
scale_x = (x_max - x_min) / (qmax - qmin)
zero_x  = np.round(qmin - x_min / scale_x).astype(np.int32)
zero_x  = np.clip(zero_x, qmin, qmax)

# 2.3 量化激活
x_q = np.clip(np.round(x_fp32 / scale_x) + zero_x, qmin, qmax).astype(np.int8)

# —— 3. QuantizeLinear（离线量化权重） —— 
w_min, w_max = w_fp32.min(), w_fp32.max()
scale_w = (w_max - w_min) / (qmax - qmin)
zero_w  = np.round(qmin - w_min / scale_w).astype(np.int32)
zero_w  = np.clip(zero_w, qmin, qmax)
w_q = np.clip(np.round(w_fp32 / scale_w) + zero_w, qmin, qmax).astype(np.int8)

# —— 4. QLinearMatMul（整数矩阵乘） —— 
#   y_int32 = (x_q - zero_x) @ (w_q - zero_w)
x_int32 = x_q.astype(np.int32) - zero_x
w_int32 = w_q.astype(np.int32) - zero_w
y_int32 = x_int32.dot(w_int32)

# —— 5. DequantizeLinear（反量化输出） —— 
#   scale_out = scale_x * scale_w
scale_y = scale_x * scale_w
y_fp32  = y_int32.astype(np.float32) * scale_y

# —— 6. 对比输出 —— 
print("x_fp32[0,:5]    ", x_fp32[0, :5])
print("x_q   [0,:5]    ", x_q[0, :5])
print("w_q   [:5,0]    ", w_q[:5, 0])
print("y_int32[0,:5]  ", y_int32[0, :5])
print("y_fp32[0,:5]   ", y_fp32[0, :5])

flowchart LR
    subgraph Activation
        x_fp32["x_fp32 (Float)"]
        x_fp32 --> DQL["DynamicQuantizeLinear"]
        DQL --> x_q["x_q (Int8)"]
    end

    subgraph Weight
        w_fp32["w_fp32 (Float)"]
        w_fp32 --> QLw["QuantizeLinear"]
        QLw --> w_q["w_q (Int8)"]
    end

    x_q --> QLM["QLinearMatMul"]
    w_q --> QLM

    QLM --> y_int32["y_int32 (Int32)"]
    y_int32 --> DQLY["DequantizeLinear"]
    DQLY --> y_fp32["y_fp32 (Float)"]

import torch
import torch.nn as nn
import torch.quantization as tq

# —— 1. 定义一个简单的模型 —— 
class SimpleConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        # 一个卷积块后接 ReLU，再接全连接
        self.conv = nn.Conv2d(1, 8, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.fc   = nn.Linear(8*28*28, 10)

    def forward(self, x):
        x = self.relu(self.conv(x))
        x = x.view(x.size(0), -1)
        return self.fc(x)

# —— 2. 实例化模型并 fuse（将 conv+relu 合并以便量化） ——
#  这里的Fuse是指把卷积和激活函数合并成一个模块，以便在量化时一起处理，这样可以降低误差叠加
model_fp32 = SimpleConvNet().eval()
model_fp32_fused = tq.fuse_modules(
    model_fp32,
    [['conv', 'relu']],
    inplace=False
)

# —— 3. 指定量化配置（qconfig） —— 
#    这里使用官方推荐的 per‑tensor 静态量化配置（fbgemm backend）
model_fp32_fused.qconfig = tq.get_default_qconfig('fbgemm')

# —— 4. 准备量化：插入 observer（FakeQuantize）节点 —— 
#    这一步会在图里为每个可量化模块添加统计激活范围的 observer
tq.prepare(model_fp32_fused, inplace=True)

# —— 5. Calibration：用代表性数据“跑”一次模型 —— 
#    只要 forward，不需要反向或优化器
#    我们用随机数据模拟：真实场景下应使用 100–1000 张校准样本
calib_data = torch.randn(100, 1, 28, 28)  # e.g. 100 张灰度 28×28 图
with torch.no_grad():
    for batch in torch.split(calib_data, 10):
        _ = model_fp32_fused(batch)

# —— 6. 转换为量化模型 —— 
#    这一步会根据 observer 里累积的 min/max 统计，计算 scale/zero_point
#    并把权重和激活节点都替换为真正的量化算子
model_int8 = tq.convert(model_fp32_fused.eval(), inplace=False)

# —— 7. 推理对比 —— 
#    随机一张输入，比较 FP32 vs. INT8 输出
test_input = torch.randn(1, 1, 28, 28)
out_fp32 = model_fp32(test_input)
out_int8 = model_int8(test_input)

print("FP32 输出：", out_fp32)
print("INT8 输出：", out_int8)

量化感知训练（QAT）

定义：在训练或微调阶段就引入量化仿真（FakeQuant）算子，让模型在前向传播中模拟低精度运算，并在反向传播继续更新参数，以适应量化带来的误差。
优点：
- 能显著恢复甚至超越 PTQ 后的精度
- 对复杂分布、自定义策略更友好
缺点：
- 需要额外的训练或微调开销（1–5 epoch）
- 实现较复杂，需要在训练框架里插入 FakeQuant 层

import torch
import torch.nn as nn
import torch.quantization as tq

# —— 定义一个简单模型 ——
class QATModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 1, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.fc = nn.Linear(16, 2)

    def forward(self, x):
        x = self.relu(self.conv(x))
        x = x.view(x.size(0), -1)
        return self.fc(x)

# —— 实例化并 fuse conv+relu —— 
model = QATModel()
model.train()
model_fused = tq.fuse_modules(model, [['conv', 'relu']])

# —— 配置 QAT —— 
model_fused.qconfig = tq.get_default_qat_qconfig('fbgemm')
tq.prepare_qat(model_fused, inplace=True)

# —— 模拟训练几个 batch —— 
opt = torch.optim.SGD(model_fused.parameters(), lr=1e-2)
for _ in range(5):
    data = torch.randn(4, 1, 4, 4)
    target = torch.randint(0, 2, (4,))
    opt.zero_grad()
    out = model_fused(data)
    loss = nn.functional.cross_entropy(out, target)
    loss.backward()
    opt.step()

# —— 转为量化模型 —— 
model_qat = tq.convert(model_fused.eval(), inplace=False)

# —— 推理 —— 
data = torch.randn(1, 1, 4, 4)
print("QAT 输出：", model_qat(data))

Current Research

LLM 专用 PTQ 方法

GPTQ 基于近似二阶信息的一次性权重量化，可在单 GPU 上对 175B 参数模型做 3–4 位量化且精度几乎无损
AWQ 通过激活感知的通道缩放，仅需保护 1% 的显著权重即可大幅降低量化误差，并在多模态 LLM 上实现高效 4‑bit 压缩与加速
SmoothQuant 则离线迁移激活离群值至权重，实现 W8A8 PTQ，在 LLM 上可带来最高 1.56× 加速和 2× 内存减少
最新的 SmoothQuant+ 将 PTQ 推向 4‑bit 群组化权重量化，实现几乎无损的 LLM 部署

Model Export and Deploy

方法	导出格式 / 工具	支持硬件	优势	劣势
TorchScript	`.pt` (TorchScript)	CPU/GPU	原生 PyTorch 支持，无需额外依赖，C++/Python 端加载方便	依赖 libtorch，GPU 性能不及专用推理框架
ONNX + ONNX Runtime	`.onnx`	CPU/GPU (多平台)	跨框架、多语言部署，生态成熟	自定义算子支持有限，性能依赖 Runtime 插件
ONNX + TensorRT	`.trt`	NVIDIA GPU	极致 GPU 推理性能，自动混合精度与量化优化	仅限 NVIDIA 平台，导出与部署流程较复杂
TensorFlow → TFLite	`.tflite`	移动端/嵌入式设备	轻量级、低延迟、小体积，适合移动与 IoT	算子支持受限，需校准数据且精度可能下降
OpenVINO	`.xml` + `.bin`	Intel CPU/VPU	多硬件统一部署，Intel 硬件深度优化	仅限 Intel 平台，上手门槛较高
Apache TVM	`.so` / `.dll`	CPU/GPU/专用加速器	高度可定制的编译流水线，多平台与多后端支持	编译与调优复杂，需要深入学习框架使用与调参