Search code examples
floating-pointroundingquantization

Floating point quantization from double to 8bit


How can I round a double precision floating point to the value that can be stored in a 8bit floating point? I'm trying to do it mathematically but I have no idea how to do.

I have an x double number and I should find the nearest y that I can express with n*2^b with n and b integer and n in [-128,127]. But how can I find the best n and b?


Solution

  • I've solved with this algorithm:

    function y = DoubleTo8bit( x )
    s=sign(x);
    x=abs(x);
    
    if x==0
        y=0;
        return; 
    end
    b=floor(log2(x)+1)-8+(s>0);
    m=s*round(x/2^b);
    
    y=m*2^b;
    end