In this paper (Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference) published by google, quantization scheme is described as follows:
Where,
M = S1 * S2 / S3
S1, S2 and S3 are scales of inputs and output respectively.
Both S1 (and zero point Z1) and S2 (and zero point Z2) can be determined easily, whether "offline" or "online". But what about S3 (and zero point Z3)? These parameters are dependent on "actual" output scale (i.e., the float
value without quantization). But output scale is unknown before it is computed.
According to the tensorflow documentation:
At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.
But the code below says something different:
tensor_utils::BatchQuantizeFloats(
input_ptr, batch_size, input_size, quant_data, scaling_factors_ptr,
input_offset_ptr, params->asymmetric_quantize_inputs);
for (int b = 0; b < batch_size; ++b) {
// Incorporate scaling of the filter.
scaling_factors_ptr[b] *= filter->params.scale;
}
// Compute output += weight * quantized_input
int32_t* scratch = GetTensorData<int32_t>(accum_scratch);
tensor_utils::MatrixBatchVectorMultiplyAccumulate(
filter_data, num_units, input_size, quant_data, scaling_factors_ptr,
batch_size, GetTensorData<float>(output), /*per_channel_scale=*/nullptr,
input_offset_ptr, scratch, row_sums_ptr, &data->compute_row_sums,
CpuBackendContext::GetFromContext(context));
Here we can see:
scaling_factors_ptr[b] *= filter->params.scale;
I think this means:
This inconsistency between paper, documentation and code makes me very confused. I can't tell what I miss. Can anyone help me?
Let me answer my own question. All of a sudden I saw what I missed when I was
riding bicycle. The code in the question above is from the function
tflite::ops::builtin::fully_connected::EvalHybrid()
. Here the
name has explained everything! Value in the output of matrix multiplication is
denoted as r3 in section 2.2 of the paper. In terms of equation
(2) in section 2.2, we have:
If we want to get the float result of matrix multiplication, we can use equation (4) in section 2.2, then convert the result back to floats, OR we can use equation (3) with the left side replaced with r3, as in:
If we choose all the zero points to be 0
, then the formula above becomes:
And this is just what EvalHybrid()
does (ignoring the bias for the moment). Turns out the paper gives an outline of the quantization algorithm, while the implementation uses different variants.