I'm trying to use UINT8 quantization while converting tensorflow model to tflite model:
If use post_training_quantize = True
, model size is x4 lower then original fp32 model, so I assume that model weights are uint8, but when I load model and get input type via interpreter_aligner.get_input_details()[0]['dtype']
it's float32. Outputs of the quantized model are about the same as original model.
converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
graph_def_file='tflite-models/tf_model.pb',
input_arrays=input_node_names,
output_arrays=output_node_names)
converter.post_training_quantize = True
tflite_model = converter.convert()
Input/output of converted model:
print(interpreter_aligner.get_input_details())
print(interpreter_aligner.get_output_details())
[{'name': 'input_1_1', 'index': 47, 'shape': array([ 1, 128, 128, 3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([ 1, 156], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]
Another option is to specify more parameters explicitly: Model size is x4 lower then original fp32 model, model input type is uint8, but model outputs are more like garbage.
converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
graph_def_file='tflite-models/tf_model.pb',
input_arrays=input_node_names,
output_arrays=output_node_names)
converter.post_training_quantize = True
converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
converter.quantized_input_stats = {input_node_names[0]: (0.0, 255.0)} # (mean, stddev)
converter.default_ranges_stats = (-100, +100)
tflite_model = converter.convert()
Input/output of converted model:
[{'name': 'input_1_1', 'index': 47, 'shape': array([ 1, 128, 128, 3], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.003921568859368563, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([ 1, 156], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.7843137383460999, 128)}]
So my questions are:
post_training_quantize = True
is set? i.e. why 1st case work fine, but second don't.'quantization': (0.0, 0)
in 1st case and 'quantization': (0.003921568859368563, 0)
,'quantization': (0.7843137383460999, 128)
in 2nd case?converter.default_ranges_stats
?Update:
Answer to question 4 is found What does 'quantization' mean in interpreter.get_input_details()?
What is happenning when only post_training_quantize = True is set? i.e. why 1st case work fine, but second don't.
In TF 1.14, this seems to just quantize the weights stored on disk, in the .tflite file. This does not, by itself, set the inference mode to quantized inference.
i.e., You can have a tflite model which has inference type float32
but the model weights are quantized (using post_training_quantize=True
) for the sake of lower disk size, and faster loading of the model at runtime.
How to estimate mean, std and range parameters for second case?
The documentation is confusing to many. Let me explain what I concluded after some research :
(mean, std_dev)
(zero_point, scale)
(min,max)
std_dev = 1.0 / scale
mean = zero_point
mean = 255.0*min / (min - max)
std_dev = 255.0 / (max - min)
min / std_dev + mean = 0
and max / std_dev + mean = 255
, then follow the math to reach the above conversion formulas min = - mean * std_dev
max = (255 - mean) * std_dev
To answer your question: , if your input image has :
mean = 0, std_dev = 1
mean = 127.5, std_dev = 127.5
mean = 0, std_dev = 255
Looks like in second case model inference is faster, is it depend on the fact that model input is uint8?
Yes, possibly. However quantized models are typically slower unless you make use of vectorized instructions of your specific hardware. TFLite is optimized to run those specialized instruction for ARM processors. As of TF 1.14 or 1.15 if you are running this on your local machine x86 Intel or AMD, then I'd be surprised if the quantized model runs faster. [Update: It's on TFLite's roadmap to add first-class support for x86 vectorized instructions to make quantized inference faster than float]
What means 'quantization': (0.0, 0) in 1st case and 'quantization': (0.003921568859368563, 0),'quantization': (0.7843137383460999, 128) in 2nd case?
Here this has the format is quantization: (scale, zero_point)
In your first case, you only activated post_training_quantize=True
, and this doesn't make the model run quantized inference, so there is no need to transform the inputs or the outputs from float to uint8. Thus quantization stats here are essentially null
, which is represented as (0,0)
.
In the second case, you activated quantized inference by providing inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
. So you have quantization parameters for both input and output, which are needed to transform your float input to uint8 on the way in to the model, and the uint8 output to a float output on the way out.
uint8_array = (float_array / std_dev) + mean
float_array = (uint8_array.astype(np.float32) - mean) * std_dev
scale
instead of std_dev
so the divisions will become multiplications and vice versa.Another confusing thing here is that, even though during conversion you specify quantization_stats = (mean, std_dev)
, the get_output_details
will return quantization: (scale, zero_point)
, not just the form is different (scale vs std_dev) but also the order is different!
Now to understand these quantization parameter values you got for the input and output, let's use the formulas above to deduce the range of real values ((min,max)
) of your inputs and outputs. Using the above formulas we get :
min = 0, max=1
(it is you who specified this by providing quantized_input_stats = {input_node_names[0]: (0.0, 255.0)} # (mean, stddev)
)min = -100.39, max=99.6