Search code examples
floating-pointneural-networkfixed-pointexponentmantissa

Reduce mantissa bit-width


Well, I feel embarrassed I cannot get this by my own, but..
How can I reduce the mantissa (and exponent) bit-with for a floating point number?


I am training a (convolutional) artificial neural network (and I'm implementing it on FPGA) and I'd like to study the relation between mantissa (and exponent) bit-width vs. testing (and training) accuracy on CPU (and GPU). Next step would be converting my floats into a fixed point representation (that is what I am using on FPGA) and see how stuff goes.

Similar kind of studies have been already done by others ([Tong, Rutenbar and Nagle (1998)] and [Leeser and Zhao (2003)]), so there should be a way of doing this, although the 'how' is not yet clear to me.

Last point, I'm programming in Lua, but I can easily include C stuff with ffi of LuaJIT.


Solution

  • To remove s bits from the significand of a binary floating-point number x and round the remaining bits, use Veltkamp’s algorithm:

    Let factor = 2**s + 1.
    Let c = factor * x.
    Let y = c - (c-x).
    

    Each operation above should be computed with floating-point arithmetic, including rounding-to-nearest with the same precision as x. Then y is the desired result.

    Note that this will round a single number to a shorter significand. It will not generally reproduce the results of computing with shorter significands. E.g., given a and b, computing ab with greater precision and then rounding to lesser precision will not always have the same result as computing ab with the final precision.

    To decrease the exponent range, you can merely compare a value to thresholds for the new exponent range and declare underflow or overflow as appropriate.