I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:
File Name | Size |
---|---|
model.onnx | 654 MB |
model_fp16.onnx | 327 MB |
model_q4.onnx | 200 MB |
model_q4f16.onnx | 134 MB |
I understand that:
model.onnx
is the fp32 model,model_fp16.onnx
is the model whose weights are quantized to fp16
I don't understand the size of model_q4.onnx
and model_q4f16.onnx
Why is model_q4.onnx
200 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4.onnx
meant that the weights are quantized to 4 bits.
Why is model_q4f16.onnx
134 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4f16.onnx
meant that the weights are quantized to 4 bits and activations are fp16, since https://llm.mlc.ai/docs/compilation/configure_quantization.html states:
qAfB(_id)
, whereA
represents the number of bits for storing weights andB
represents the number of bits for storing activations.
and Why do activations need more bits (16bit) than weights (8bit) in tensor flow's neural network quantization framework? indicates that activations don't count toward the model size (understandably).
Note that FLOAT32
is 32 bits and INT4
is 4 bits, so you'd expect the quantized weights to reduce down to 654 / 8 = 81.75 MB
, not 163.5 MB
.
The reason you are seeing nonlinear reduction in file size is because quantized models aren't (usually) completely free of floating point arithmetic. Weights of neural networks are (usually) fairly small and distributed around zero - that is, impossible to represent with integers with any meaningful precision. To mitigate this, most quantization schemes first scale the floating point inputs to a value representable with integers, cast, perform the operation with the quantized data type, and finally scale and cast the outputs back to floating point.
So, the quantized onnx
graphs contain not only the INT4 weights, but also the scaling factors for each of them. In model_q4
the scaling factors are in FLOAT32
(e.g. the intermediate activations live in single-precision), and model_q4f16
has them in FLOAT16
, hence the smaller file size. We can verify this by checking that the difference between the actual size and the expected file size is approximately double for model_q4
compared to model_q4f16
.