Search code examples
tensorflowtensorflow-litequantization

Question about inconsistency between tensorflow lite quantization code, paper and documentation


In this paper (Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference) published by google, quantization scheme is described as follows:

tf-lite quantization scheme

Where,

M = S1 * S2 / S3

S1, S2 and S3 are scales of inputs and output respectively.

Both S1 (and zero point Z1) and S2 (and zero point Z2) can be determined easily, whether "offline" or "online". But what about S3 (and zero point Z3)? These parameters are dependent on "actual" output scale (i.e., the float value without quantization). But output scale is unknown before it is computed.

According to the tensorflow documentation:

At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.

But the code below says something different:

  tensor_utils::BatchQuantizeFloats(
      input_ptr, batch_size, input_size, quant_data, scaling_factors_ptr,
      input_offset_ptr, params->asymmetric_quantize_inputs);
  for (int b = 0; b < batch_size; ++b) {
    // Incorporate scaling of the filter.
    scaling_factors_ptr[b] *= filter->params.scale;
  }

  // Compute output += weight * quantized_input
  int32_t* scratch = GetTensorData<int32_t>(accum_scratch);
  tensor_utils::MatrixBatchVectorMultiplyAccumulate(
      filter_data, num_units, input_size, quant_data, scaling_factors_ptr,
      batch_size, GetTensorData<float>(output), /*per_channel_scale=*/nullptr,
      input_offset_ptr, scratch, row_sums_ptr, &data->compute_row_sums,
      CpuBackendContext::GetFromContext(context));

Here we can see:

  scaling_factors_ptr[b] *= filter->params.scale;

I think this means:

  1. S1 * S2 is computed.
  2. The weights are still integers. Just the final results are floats.
  3. It seems S3 and Z3 don't have to be computed. But if so, how can the final float results be close to the unquantized results?

This inconsistency between paper, documentation and code makes me very confused. I can't tell what I miss. Can anyone help me?


Solution

  • Let me answer my own question. All of a sudden I saw what I missed when I was riding bicycle. The code in the question above is from the function tflite::ops::builtin::fully_connected::EvalHybrid(). Here the name has explained everything! Value in the output of matrix multiplication is denoted as r3 in section 2.2 of the paper. In terms of equation (2) in section 2.2, we have:

    enter image description here

    If we want to get the float result of matrix multiplication, we can use equation (4) in section 2.2, then convert the result back to floats, OR we can use equation (3) with the left side replaced with r3, as in:

    enter image description here

    If we choose all the zero points to be 0, then the formula above becomes:

    enter image description here

    And this is just what EvalHybrid() does (ignoring the bias for the moment). Turns out the paper gives an outline of the quantization algorithm, while the implementation uses different variants.