Search code examples
mathtensorflowquantization8-bit

How 8 bit arithmetic is done in Tensorflow?


This TensorFlow guide gives some insights on 8 bit representation of the neural network weight and activations. It maps the range from min-max in float32 to 8bit format by mapping min value in float32 to 0 in int8 and max value to 255. This means the addition identity (0) is mapped to non-zero value and even the multiplication identity (1) may be mapped to value other than 1 in the int8 representation. My questions are,

  1. After loosing these identities, how the arithmetic is performed in the new representation? In case of addition/sub, we can get back the approx float32 number after appropriate scaling and offseting.

  2. How to convert the result of multiplication in int8 format to the native float32 format?


Solution

  • There are some more details of the quantization process in practice here: http://www.oreilly.com/data/free/building-mobile-applications-with-tensorflow.csp

    We'll be updating the tensorflow.org documentation soon too. To specifically answer #2, you have a new min/max float range for your 32-bit accumulated result which you can use to convert back to floats.