When converting TF model to tflite model (or in other words - quantize a model using "post-training quantization"), the Relu layers disappears from the graph. This is explained in the documentation: "operations that can be simply removed from the graph (tf.identity), replaced by tensors (tf.placeholder), or fused into more complex operations (tf.nn.bias_add)."
My question is - how can a Relu layer be fused into a prior layer? (What is the math beyond this "fusion"? Is this a specific procedure to quantized models, or this can be done also in the original floating-point model?)
For the Relu (or activation functions) in TFLite, the fusion doesn't really have some math behind it, but more because the Conv kernel support doing the activation while computing the Convolution. So, instead of building a tensor of X elements as output from Conv and then pass it as input to the following Relu layer which just iterates over to compute it, you can just clamp the values directly during convolution. So, because the TFLite kernel supports this we can simplify the graph during conversion and fuse the Activation layer with the conv and set the FusedActivationFunction type in the ConvParams to which activation should happen during convolution. This is not specific to quantized model, TFLite float Conv kernels also does this. Here is an example where the clamping values set before the GEMM https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/optimized/optimized_ops.h#L1338
Or in the reference kernel https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L91
For bias_add disappearing, the converter fuse the bias_add and conv and set the bias param in the Op (that's in case the value to add is constant), so the Kernel can add the bias value during convolution computation https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/reference/conv.h#L89
For cases like Mul, the converter fuses the mul with Conv if the multiplier is some constant
Mul (Const_A, Conv(Input, Filter), bias)
Conv(Input, (Filter * Const_A), (bias * Const_A))
Assuming that Const_A and Filter are broadcastable types
This happens during conversion https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/mlir/lite/transforms/optimize_patterns.td#L118
Hope that helps.