neural-network artificial-intelligence onnx quantization static-quantization

Does static quantization enable the model to feed a layer with the output of the previous one, without converting to fp (and back to int)?

I was reading about quantization (specifically abount int8) and trying to figure it out if there is a method to avoid dequantize and requantize the output of a node before feeding it to the next one. So i eventually find the definition of static and dynamic quantization. According to onnxruntime:

Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically. [...] Static quantization method first runs the model using a set of inputs called calibration data. During these runs, we compute the quantization parameters for each activations. These quantization parameters are written as constants to the quantized model and used for all inputs.

To me that seem quite clear, saying that the difference between the two methods is about when (de)quantization parameters are computed (with dynamic doing it at inference time and static doing it before inference and hardcoding them in the model) and not about the actual (de)quantization process.

However I got in touch with some articles/forum answers which seems to point to a different direction. This article say about static quantization:

[...] Importantly, this additional step allows us to pass quantized values between operations instead of converting these values to floats - and then back to ints - between every operation, resulting in a significant speed-up.

It seems to be arguing that static quantization does not require to apply dequantize and then quantize operations to the output of a node before feeding it as input to the next one. I also found a discussion arguing the same:

Q: [...] However, our hardware colleagues told me that because it has FP scales and zero-points in channels, the hardware should still support FP in order to implement it. They also argued that in each internal stage, the values (in-channels) should be dequantized and converted to FP and quantized again for the next layer. [...]

A: For the first argument you are right, since scales and zero-points are FP, hardware need to support FP for the computation. The second argument may not be true, for static quantization the output of the previous layer can be fed into next layer without dequantizing to FP. Maybe they are thinking about dynamic quantization, which keeps tensors between two layers in FP.

And others have aswered the same.

So I tried out to manually quantize a model using onnxruntime.quantization.quantize_static. Before going on I have to make a premise: I'm not in the field of AI, and I'm learning about the topic for another purpose. So I googled to find out how to do that and I managed to get it done with the following code:

import torch
import torchvision as tv
import onnxruntime
from onnxruntime import quantization


MODEL_PATH = "best480x640.onnx"
MODEL_OPTIMIZED_PATH = "best480x640_optimized.onnx"
QUANTIZED_MODEL_PATH = "best480x640_quantized.onnx"


class QuntizationDataReader(quantization.CalibrationDataReader):
    def __init__(self, torch_ds, batch_size, input_name):

        self.torch_dl = torch.utils.data.DataLoader(
            torch_ds, batch_size=batch_size, shuffle=False)

        self.input_name = input_name
        self.datasize = len(self.torch_dl)

        self.enum_data = iter(self.torch_dl)

    def to_numpy(self, pt_tensor):
        return (pt_tensor.detach().cpu().numpy() if pt_tensor.requires_grad
                else pt_tensor.cpu().numpy())

    def get_next(self):
        batch = next(self.enum_data, None)
        if batch is not None:
            return {self.input_name: self.to_numpy(batch[0])}
        else:
            return None

    def rewind(self):
        self.enum_data = iter(self.torch_dl)


preprocess = tv.transforms.Compose([
    tv.transforms.Resize((480, 640)),
    tv.transforms.ToTensor(),
    tv.transforms.Normalize(
        mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

ds = tv.datasets.ImageFolder(root="./calib/", transform=preprocess)

# optimisations
quantization.shape_inference.quant_pre_process(
    MODEL_PATH, MODEL_OPTIMIZED_PATH, skip_symbolic_shape=False)

quant_ops = {"ActivationSymmetric": False, "WeightSymmetric": True}
ort_sess = onnxruntime.InferenceSession(
    MODEL_PATH, providers=["CPUExecutionProvider"])
qdr = QuntizationDataReader(
    ds, batch_size=1, input_name=ort_sess.get_inputs()[0].name)
quantized_model = quantization.quantize_static(
    model_input=MODEL_OPTIMIZED_PATH,
    model_output=QUANTIZED_MODEL_PATH,
    calibration_data_reader=qdr,
    extra_options=quant_ops
)

However results confused me more. The following images show a chunk of the two models graphs (the "original" one and the quantized one) on netron. This is the non quantized model graph.

While this is the quantized one.

The fact that it added QuantizeLinear/DequantizeLinear nodes may indicate the answer I'm looking for. However, the way those nodes are placed makes no sense to me: it computes dequantization immediately after quantization, so the input types of various Conv, Mul, etc nodes is still float32 tensors. I'm sure I'm missing (or misunderstanding) something here, so I can't figure out what I was originally looking for: does static quantization allow to feed a node with the still quantized output of the previous one? And what I'm getting wrong with the quantization process above?

Solution

Hardware AI guy here, I highly recommend reading my blog, https://franciscormendes.github.io/2024/05/16/quantization-layer-details/ But I will summarize it here. In short : if you want, you can pass values as int between layers as well. Consider the matrix multiplication (which is nothing but the output of a single layer in a neural network, with weights $W$ and bias $b$),

Y = Wx+b

$Y = Wx + b$

This can be represented as a quantized multiplication (you can find the details in the blog),

Option 1:

$$Y = S_q(X_q-Z_q)S_w(W_q-Z_w) + S_b(b_q-Z_b)$$

However, you can quantize the output too,

Option 2: $$Y_q = \frac{S_xS_w}{S_Y}((X_q-Z_x)(W_q-Z_w)+b) + Z_Y$$

Remember that $\frac{S_xS_W}{S_Y}$ is constant and we know it at the time of compiling, so we can consider it to be a fixed point operation, we can write it as

$$M := \frac{S_xS_W}{S_Y} = 2^{-n}M_0$$ where $n$ is always a fixed number determined at the time of compilation (this is not true for floating point). Thus the entire expression, $$Y_q = M((X_q-Z_x)(W_q-Z_w)+b) + Z_Y$$ can be carried out with integer arithmetic and all values exchanged between layers are integer values. So if your hardware supports only INT8 you will use

Using the matrix multiplication example, FULL INT8 quantization essentially means you can deploy a neural network on a board that does not support ANY floating point operations. It is in fact the $Y_q$ that is passed between layers when you do INT8 quantization.

Option 1(a) : $$Y_q = M_0((X_q-Z_x)(W_q-Z_w)+b) + Z_Y$$.

However, if you just need the weights and the multiplies to be quantized but not the activations, it means that you are getting the benefits of quantization for saving space of the weights and by using integer multiply BUT are choosing to pass values between the layers as floats. For this case, PyTorch and Keras can also spit out the floating point values, to be passed between layers, and it does this by simply omitting the quantization step, so in this case you do not need to quantize the output (Option 1)

$$Y = S_xS_w(X_q-Z_x)(W_q-Z_w) + S_b(b_q-Z_b)$$