Question about inconsistency between tensorflow lite quantization code, paper and documentation

In this paper (Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference) published by google, quantization scheme is described as follows:

Where,

M = S₁ * S₂ / S₃

S₁, S₂ and S₃ are scales of inputs and output respectively.

Both S₁ (and zero point Z₁) and S₂ (and zero point Z₂) can be determined easily, whether "offline" or "online". But what about S₃ (and zero point Z₃)? These parameters are dependent on "actual" output scale (i.e., the float value without quantization). But output scale is unknown before it is computed.

According to the tensorflow documentation:

At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.

But the code below says something different:

  tensor_utils::BatchQuantizeFloats(
      input_ptr, batch_size, input_size, quant_data, scaling_factors_ptr,
      input_offset_ptr, params->asymmetric_quantize_inputs);
  for (int b = 0; b < batch_size; ++b) {
    // Incorporate scaling of the filter.
    scaling_factors_ptr[b] *= filter->params.scale;
  }

  // Compute output += weight * quantized_input
  int32_t* scratch = GetTensorData<int32_t>(accum_scratch);
  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
      filter_data, num_units, input_size, quant_data, scaling_factors_ptr,
      batch_size, GetTensorData<float>(output), /*per_channel_scale=*/nullptr,
      input_offset_ptr, scratch, row_sums_ptr, &data->compute_row_sums,
      CpuBackendContext::GetFromContext(context));

Here we can see:

  scaling_factors_ptr[b] *= filter->params.scale;

I think this means:

S₁ * S₂ is computed.
The weights are still integers. Just the final results are floats.
It seems S₃ and Z₃ don't have to be computed. But if so, how can the final float results be close to the unquantized results?

This inconsistency between paper, documentation and code makes me very confused. I can't tell what I miss. Can anyone help me?

Solution

Let me answer my own question. All of a sudden I saw what I missed when I was riding bicycle. The code in the question above is from the function tflite::ops::builtin::fully_connected::EvalHybrid(). Here the name has explained everything! Value in the output of matrix multiplication is denoted as r₃ in section 2.2 of the paper. In terms of equation (2) in section 2.2, we have:

If we want to get the float result of matrix multiplication, we can use equation (4) in section 2.2, then convert the result back to floats, OR we can use equation (3) with the left side replaced with r₃, as in:

If we choose all the zero points to be 0, then the formula above becomes:

And this is just what EvalHybrid() does (ignoring the bias for the moment). Turns out the paper gives an outline of the quantization algorithm, while the implementation uses different variants.