Search code examples
deep-learninglarge-language-modelhuggingfaceonnxquantization

Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?


I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:

File Name Size
model.onnx 654 MB
model_fp16.onnx 327 MB
model_q4.onnx 200 MB
model_q4f16.onnx 134 MB

I understand that:

  • model.onnx is the fp32 model,
  • model_fp16.onnx is the model whose weights are quantized to fp16

I don't understand the size of model_q4.onnx and model_q4f16.onnx

  1. Why is model_q4.onnx 200 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4.onnx meant that the weights are quantized to 4 bits.

  2. Why is model_q4f16.onnx 134 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4f16.onnx meant that the weights are quantized to 4 bits and activations are fp16, since https://llm.mlc.ai/docs/compilation/configure_quantization.html states:

    qAfB(_id), where A represents the number of bits for storing weights and B represents the number of bits for storing activations.

and Why do activations need more bits (16bit) than weights (8bit) in tensor flow's neural network quantization framework? indicates that activations don't count toward the model size (understandably).


Solution

  • Note that FLOAT32 is 32 bits and INT4 is 4 bits, so you'd expect the quantized weights to reduce down to 654 / 8 = 81.75 MB, not 163.5 MB.

    The reason you are seeing nonlinear reduction in file size is because quantized models aren't (usually) completely free of floating point arithmetic. Weights of neural networks are (usually) fairly small and distributed around zero - that is, impossible to represent with integers with any meaningful precision. To mitigate this, most quantization schemes first scale the floating point inputs to a value representable with integers, cast, perform the operation with the quantized data type, and finally scale and cast the outputs back to floating point.

    So, the quantized onnx graphs contain not only the INT4 weights, but also the scaling factors for each of them. In model_q4 the scaling factors are in FLOAT32 (e.g. the intermediate activations live in single-precision), and model_q4f16 has them in FLOAT16, hence the smaller file size. We can verify this by checking that the difference between the actual size and the expected file size is approximately double for model_q4 compared to model_q4f16.

    This is a very nice visual guide to how quantization works.