python tensorflow math pytorch quantization

Dequantize values to their original prior to quantization

The paper "Natural Language Processing with Small Feed-Forward Networks" https://arxiv.org/pdf/1708.00214.pdf states:

I've implemented quantization as per the above equations in python:

b = 128

embedding_matrix = [[20000,3000,1000],[1999999,20000,1999999], [20000,3000,1000]]

scaled = [ abs(round( (1 / (b - 1) * max(e)) , 3)) for e in embedding_matrix]

print(scaled)

i = 0

quantized = []
for e in embedding_matrix :
    for v in e : 
        quantized.append((v , math.floor(.5 + ( (v / scaled[i]) + b) )))
    i = i + 1
    
quantized

Running this code quantized is set to :

[(20000, 255),
 (3000, 147),
 (1000, 134),
 (1999999, 255),
 (20000, 129),
 (1999999, 255),
 (20000, 255),
 (3000, 147),
 (1000, 134)]

How to de-quantize back to the original values prior to quantization ?

Reading https://www.tensorflow.org/api_docs/python/tf/quantization/dequantize describes :

tf.quantization.dequantize(
    input, min_range, max_range, mode='MIN_COMBINED', name=None, axis=None,
    narrow_range=False, dtype=tf.dtypes.float32
)

[min_range, max_range] are scalar floats that specify the range for the output. The 'mode' attribute controls exactly which calculations are used to convert the float values to their quantized equivalents.

and the PyTorch docs: https://pytorch.org/docs/stable/quantization.html

Seems to implement quantize differently to above implementation ?

Solution

What they are doing in the paper is roughly this:

import numpy as np

b = 128

embedding_matrix = np.array([[20000,3000,1000,1000],[1999999,20000,1999999,1999999], [20000,3000,1000,1000]])
scales = (np.abs(embedding_matrix).max(axis=1) / (b-1)).reshape(-1, 1)
quantized = (embedding_matrix / scales + b + 0.5).astype(np.uint8)
dequantized = (quantized - b) * scales
print(quantized)
print(dequantized)

Output:

[[255 147 134 134]
 [255 129 255 255]
 [255 147 134 134]]
[[2.00000000e+04 2.99212598e+03 9.44881890e+02 9.44881890e+02]
 [1.99999900e+06 1.57480236e+04 1.99999900e+06 1.99999900e+06]
 [2.00000000e+04 2.99212598e+03 9.44881890e+02 9.44881890e+02]]

In short they just have q_ij = round(e_ij / s_i + b), so after you just have quantized value q_ij your best approximation is to say that q_ij = dequantized_ij / s_i + b, so dequantized_ij = (q_ij - b) * s_i

As to pytorch - similar functionality is available with torch.quantize_per_channel e.g the following code is doing pretty much the same:

import torch
t = torch.tensor(embedding_matrix, dtype=torch.float32)
zero_point = torch.tensor([b]).repeat(t.shape[0], 1).reshape(-1)
quantized_tensor = torch.quantize_per_channel(t, t.abs().max(axis=1)[0] / (b-1), zero_point, 0, torch.quint8)
print(quantized_tensor)
print(quantized_tensor.int_repr())

Output:

tensor([[2.0000e+04, 2.9921e+03, 9.4488e+02, 9.4488e+02],
        [2.0000e+06, 1.5748e+04, 2.0000e+06, 2.0000e+06],
        [2.0000e+04, 2.9921e+03, 9.4488e+02, 9.4488e+02]], size=(3, 4),
       dtype=torch.quint8, quantization_scheme=torch.per_channel_affine,
       scale=tensor([  157.4803, 15748.0234,   157.4803], dtype=torch.float64),
       zero_point=tensor([128, 128, 128]), axis=0)
tensor([[255, 147, 134, 134],
        [255, 129, 255, 255],
        [255, 147, 134, 134]], dtype=torch.uint8)

If quantized per channel like this in pytorch you can only apply .dequantize() on the full tensor rather then the sliced which wouldn't be a good thing for embeddings, but you can do it manually very easy using repr_int, q_per_channel_zero_points, and q_per_channel_scales.

Does this answer your question?