Search code examples
c++floating-pointieee-754packing

Packing 32bit floats into 30 bits (c++)


Here are the goals I'm trying to achieve:

  • I need to pack 32 bit IEEE floats into 30 bits.
  • I want to do this by decreasing the size of mantissa by 2 bits.
  • The operation itself should be as fast as possible.
  • I'm aware that some precision will be lost, and this is acceptable.
  • It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.

I guess this questions consists of two parts:

1) Can I just simply clear the least significant bits of mantissa? I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:

float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;

2) If there are cases where 1) will fail, then what would be the fastest way to achieve this?

Thanks in advance


Solution

  • I can't select any of the answers as the definite one, because most of them have valid information, but not quite what I was looking for. So I'll just summarize my conclusions.

    The method for conversion I've posted in my question's part 1) is clearly wrong by C++ standard, so other methods to extract float's bits should be used.

    And most important... as far as I understand from reading the responses and other sources about IEEE754 floats, it's ok to drop the least significant bits from mantissa. It will mostly affect only precision, with one exception: sNaN. Since sNaN is represented by exponent set to 255, and mantissa != 0, there can be situation where mantissa would be <= 3, and dropping last two bits would convert sNaN to +/-Infinity. But since sNaN are not generated during floating point operations on CPU, its safe under controlled environment.