tensorflow mnist tensorflow-lite quantization

Why the accuracy of TF-lite is not correct after quantization

I am trying TF-lite converter with TF1.12. And found that the accuracy of TF-lite is not correct after quantization. Take MNIST for example, if convert to f32 with the following command, it still can tell the correct when run convolution_test_lite.py with conv_net_f32.tflite.

*tflite_convert --output_file model_lite/conv_net_f32.tflite \
--graph_def_file frozen_graphs/conv_net.pb  \
--input_arrays "input" \
--input_shapes "1,784" \
--output_arrays output \
--output_format TFLITE*

But when I use the following script to convert and input data with 0-255. The accuracy seems incorrect when run convolution_test_lite.py with conv_net_uint8.tflite.

*UINT8:
tflite_convert --output_file model_lite/conv_net_uint8.tflite \
--graph_def_file frozen_graphs/conv_net.pb  \
--input_arrays "input" \
--input_shapes "1,784" \
--output_arrays output \
--output_format TFLITE \
--mean_values 128 \
--std_dev_values 128 \
--default_ranges_min 0 \
--default_ranges_max 6 \
--inference_type QUANTIZED_UINT8 \
--inference_input_type QUANTIZED_UINT8*

The further code is uploaded here: https://github.com/mvhsin/TF-lite/blob/master/mnist/convolution_test_lite.py

Does anyone know the reason? Many thanks for the help!

Solution

I believe there are multiple issues buried in this. Let me address these one by one.

1. The input values should be quantized.

Your test code (convolution_test_lite.py) is not quantizing the input values correctly.

In case of QUANTIZED_UINT8 quantization:

real_input_value = (quantized_input_value - mean_value) / std_dev_value

And this means, to convert your input value [0,1] into the quantized int value, you need to:

quantized_input_value = real_input_value * std_dev_value + mean_value

and apply this to all of your input values.

So, in your convolution_test_lite.py, try changing:

input_data = input_data.astype(np.uint8)

# Use the same values provided to the converter
mean_value = 0
std_dev_value = 255

input_data = input_data * std_dev_value + mean_value
input_data = input_data.astype(np.uint8)

Same applies for the output. You should dequantize the output by:

real_output_value = (quantized_output_value - mean_value) / std_dev_value

That being said, since you're just getting the argmax, the dequentization step is not as important. If you want to see the actual softmaxed values adding up to 1, you should dequantize the outputs.

2. Missing real min-max range values

Even when you do the input quantization correctly, the accuracy of the model would drop significantly. This is because the model was not trained using the Quantization-aware training technique (which you linked in your comment). The quantization-aware training lets you actually capture the real min-max ranges of the intermediate values needed for proper full-integer quantization.

Since the model was not trained with this technique, the best we can do is to provide the default min-max ranges, which are the --default_ranges_min, --default_ranges_max values. This is called dummy quantization, and it is expected that the model's accuracy drop significantly.

If you used the quantization-aware training, you wouldn't need to provide the default values, and the fully quantized model would produce accurate results.

3. Quantization range

This would be a relatively minor issue, but since the MNIST input value range is [0, 1] it is better to use:

mean_value 0
std_dev_value 255

So that the int value 0 maps to 0.0, and 255 maps to 1.0.

Alternative: Try post-training quantization

The post-training quantization just quantizes the weight values, and thus reduces the model size dramatically. Because the input/outputs are not quantized in this case, you could basically use the post-training-quantized tflite model in place of the float32 model.

You can try:

tflite_convert --output_file model_lite/conv_net_post_quant.tflite \
  --graph_def_file frozen_graphs/conv_net.pb  \
  --input_arrays "input" \
  --input_shapes "1,784" \
  --output_arrays output \
  --output_format TFLITE \
  --post_training_quantize 1

By providing the --post_training_quantize 1 option, you could see that it's producing a much smaller model compared to the regular float32 version.

You could run this model in the same way you run the float32 model in the convolution_test_lite.py.

Hope this helps.