First of all, IEEE754 half-precision floating point number uses 16 bits. It uses 1 bit sign, 5 bits exponent, and 10 bit mantissa. actual value can be calculated to be sign * 2^(exponent-15) * (1+mantisa/1024).
I'm trying to run a image detection program using half precision. The original program is using single precision (=float). I'm using the half precision class in http://half.sourceforge.net/. Using the class half, I can run the same program at least.(by using half instead of float and compiling with g++ instead of gcc, and after many many type castings..)
I found a problem where multiplication seems to be wrong.
here is the sample code to see the problem (To print half precision number, I should cast it to float to see the value. and automatic casting doesn't take place in operations of half and integer so I put some castings..) :
#include <stdio.h>
#include "half.h"
using half_float::half;
typedef half Dtype;
main()
{
#if 0 // method 0 : this makes sx 600, which is wrong.
int c = 325;
Dtype w_scale = (Dtype)1.847656;
Dtype sx = Dtype(c*w_scale);
printf("sx = %f\n", (float)sx); // <== shows 600.000 which is wrong.
#else // method 1, which also produces wrong result..
int c = 325;
Dtype w_scale = (Dtype)1.847656;
Dtype sx = (Dtype)((Dtype)c*w_scale);
printf("sx = %f\n", (float)sx);
printf("w_scale specified as 1.847656 was 0x%x\n", *(unsigned short *)&w_scale);
#endif
}
The result looks like this :
w_scale = 0x3f63
sx = 600
sx = 0x60b0
But the sx should be 325 * 1.847656 = 600.4882. What can be wrong?
ADD : When I first posted this question, I didn't expect the value to be exactly 600.4882 but somewhere close to it. I later found the half precision, with its limitation of expressing only 3~4 effective digits, the closest value of the multication just turned out to be just 600.00. Though everybody knows floating point has this kind of limitations, some people will make a mistake like me by overlooking the fact that half-precision can have only 3~4 effective digits. So I think this question is worth a look-at by future askers. (In stackoverflow, I think some people just take every questions as the same old question when it's actually a slightly different cases. ANd it doesn't harm to have a couple of similar questions.)
I figured it out why. The half-precision has an effective precision of approx log10(2^10) ~ 3 or 4 digits. I wanted the sx
to be printed as 600.488 or something close but this cannot be represented using half-precision.
This part came during the image preprocessing that can be done without 16 bit precision (our tentative hardware), so I can just use float operation for this stage.
ADD : this anomaly came during image dimension calculation, and we don't have any reason to use 16 bit float for this case. Just image data (pixel, or feature map data) should use 16 bit float. Having written this, it's a general rule.