Search code examples
pythontensorflowdeep-learningtensorflow-litequantization

Understanding tf.contrib.lite.TFLiteConverter quantization parameters


I'm trying to use UINT8 quantization while converting tensorflow model to tflite model:

If use post_training_quantize = True, model size is x4 lower then original fp32 model, so I assume that model weights are uint8, but when I load model and get input type via interpreter_aligner.get_input_details()[0]['dtype'] it's float32. Outputs of the quantized model are about the same as original model.

converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
        graph_def_file='tflite-models/tf_model.pb',
        input_arrays=input_node_names,
        output_arrays=output_node_names)
converter.post_training_quantize = True
tflite_model = converter.convert()

Input/output of converted model:

print(interpreter_aligner.get_input_details())
print(interpreter_aligner.get_output_details())
[{'name': 'input_1_1', 'index': 47, 'shape': array([  1, 128, 128,   3], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([  1, 156], dtype=int32), 'dtype': <class 'numpy.float32'>, 'quantization': (0.0, 0)}]

Another option is to specify more parameters explicitly: Model size is x4 lower then original fp32 model, model input type is uint8, but model outputs are more like garbage.

converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(
        graph_def_file='tflite-models/tf_model.pb',
        input_arrays=input_node_names,
        output_arrays=output_node_names)
converter.post_training_quantize = True
converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
converter.quantized_input_stats = {input_node_names[0]: (0.0, 255.0)}  # (mean, stddev)
converter.default_ranges_stats = (-100, +100)
tflite_model = converter.convert()

Input/output of converted model:

[{'name': 'input_1_1', 'index': 47, 'shape': array([  1, 128, 128,   3], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.003921568859368563, 0)}]
[{'name': 'global_average_pooling2d_1_1/Mean', 'index': 45, 'shape': array([  1, 156], dtype=int32), 'dtype': <class 'numpy.uint8'>, 'quantization': (0.7843137383460999, 128)}]

So my questions are:

  1. What is happenning when only post_training_quantize = True is set? i.e. why 1st case work fine, but second don't.
  2. How to estimate mean, std and range parameters for second case?
  3. Looks like in second case model inference is faster, is it depend on the fact that model input is uint8?
  4. What means 'quantization': (0.0, 0) in 1st case and 'quantization': (0.003921568859368563, 0),'quantization': (0.7843137383460999, 128) in 2nd case?
  5. What is converter.default_ranges_stats ?

Update:

Answer to question 4 is found What does 'quantization' mean in interpreter.get_input_details()?


Solution

  • What is happenning when only post_training_quantize = True is set? i.e. why 1st case work fine, but second don't.

    In TF 1.14, this seems to just quantize the weights stored on disk, in the .tflite file. This does not, by itself, set the inference mode to quantized inference.

    i.e., You can have a tflite model which has inference type float32 but the model weights are quantized (using post_training_quantize=True) for the sake of lower disk size, and faster loading of the model at runtime.

    How to estimate mean, std and range parameters for second case?

    The documentation is confusing to many. Let me explain what I concluded after some research :

    1. Unfortunately quantization parameters/stats has 3 equivalent forms/representations across the TF library and documentation :
      • A) (mean, std_dev)
      • B) (zero_point, scale)
      • C) (min,max)
    2. Conversion from B) and A):
      • std_dev = 1.0 / scale
      • mean = zero_point
    3. Conversion from C) to A):
      • mean = 255.0*min / (min - max)
      • std_dev = 255.0 / (max - min)
      • Explanation: quantization stats are parameters used for mapping the range (0,255) to an arbitrary range, you can start from the 2 equations: min / std_dev + mean = 0 and max / std_dev + mean = 255, then follow the math to reach the above conversion formulas
    4. Conversion from A) to C):
      • min = - mean * std_dev
      • max = (255 - mean) * std_dev
    5. The naming "mean" and "std_dev" are confusing and are largely seen as misnomers.

    To answer your question: , if your input image has :

    • range (0,255) then mean = 0, std_dev = 1
    • range (-1,1) then mean = 127.5, std_dev = 127.5
    • range (0,1) then mean = 0, std_dev = 255

    Looks like in second case model inference is faster, is it depend on the fact that model input is uint8?

    Yes, possibly. However quantized models are typically slower unless you make use of vectorized instructions of your specific hardware. TFLite is optimized to run those specialized instruction for ARM processors. As of TF 1.14 or 1.15 if you are running this on your local machine x86 Intel or AMD, then I'd be surprised if the quantized model runs faster. [Update: It's on TFLite's roadmap to add first-class support for x86 vectorized instructions to make quantized inference faster than float]

    What means 'quantization': (0.0, 0) in 1st case and 'quantization': (0.003921568859368563, 0),'quantization': (0.7843137383460999, 128) in 2nd case?

    Here this has the format is quantization: (scale, zero_point)

    In your first case, you only activated post_training_quantize=True, and this doesn't make the model run quantized inference, so there is no need to transform the inputs or the outputs from float to uint8. Thus quantization stats here are essentially null, which is represented as (0,0).

    In the second case, you activated quantized inference by providing inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8. So you have quantization parameters for both input and output, which are needed to transform your float input to uint8 on the way in to the model, and the uint8 output to a float output on the way out.

    • At input, do the transformation: uint8_array = (float_array / std_dev) + mean
    • At output, do the transformation: float_array = (uint8_array.astype(np.float32) - mean) * std_dev
    • Note .astype(float32) this is necessary in python to get correct calculation
    • Note that other texts may use scale instead of std_dev so the divisions will become multiplications and vice versa.

    Another confusing thing here is that, even though during conversion you specify quantization_stats = (mean, std_dev), the get_output_details will return quantization: (scale, zero_point), not just the form is different (scale vs std_dev) but also the order is different!

    Now to understand these quantization parameter values you got for the input and output, let's use the formulas above to deduce the range of real values ((min,max)) of your inputs and outputs. Using the above formulas we get :

    • Input range : min = 0, max=1 (it is you who specified this by providing quantized_input_stats = {input_node_names[0]: (0.0, 255.0)} # (mean, stddev) )
    • Output range: min = -100.39, max=99.6