I am implementing a converter for IEEE 754 32 bits to a Fixed point with S15.16 in a FPGA. The IEEE-754 standard represent the number as:
Where s
represent the sign, exp
is the exponent denormalized and m
is the mantissa. All these values separately are represented in fixed point.
Well, the simplest way is take the IEEE-754 value and multiplies by 2**16. Finally, round it to the nearest to get the less error in truncation.
Problem: I'm doing in a FPGA device, so, I can't do it in this way.
Solution: Use the binary representations from values to perform the conversion via bitwise operations
From the previous expression, and with the condition of the exponent and mantissa are in fixed point, logic says me that I can perform as this:
Because powers of two are shifts in fixed point, is possible to rewrite the expression as (with Verilog notation):
x_fixed = ({1'b1, m[22:7]}) << (exp - 126)
Ok, this works perfectly, but not all the times... The problem here is: How can I apply nearest rounding? I have performed experiments to see what happens, in different ranges. The ranges are contained within powers of 2. I want to say:
And so on with the values contained in the following powers of two... When the values are contained from 1 to 2, I have been able to round without problems seeing the behaviour of the 2 followings bits that have been discarded in the mantissa. This bits show that:
if 00: Rounding is not necessary
if 01 or 10: Adding one to the shifted mantissa
if 11: adding two to the shifted mantissa.
To perform the experiments I have implemented a minimal solution in Python using bitwise operations. Codes are:
# Get the bits of sign, exponent and mantissa
def FLOAT_2_BIN(num):
bits, = struct.unpack('!I', struct.pack('!f', num))
N = "{:032b}".format(bits)
a = N[0] # sign
b = N[1:9] # exponent
c = "1" + N[9:] # mantissa with hidden bit
return {'sign': a, 'exp': b, 'mantissa': c}
# Convert the floating point value to fixed via
# bitwise operations
def FLOAT_2_FIXED(x):
# Get the IEEE-754 bit representation
IEEE754 = FLOAT_2_BIN(x)
# Exponent minus 127 to normalize
shift = int(IEEE754['exp'],2) - 126
# Get 16 MSB from mantissa
MSB_mnts = IEEE754['mantissa'][0:16]
# Convert value from binary to int
value = int(MSB_mnts, 2)
# Get the rounding bits: similars to guard bits???
rnd_bits = IEEE754['mantissa'][16:18]
# Shifted value by exponent
value_shift = value << shift
# Control to rounding nearest
# Only works with values from 1 to 2
if rnd_bits == '00':
rnd = 0
elif rnd_bits == '01' or rnd_bits == '10':
rnd = 1
rnd = 2
return value_shift + rnd
The test with values between 0 and 1 gives the following results:
Test for values from 1 <= x < 2
---------------- ----------------- ----------------- ----------------- ---------- ------ ----------------
1 1000000000000000 65536 65536 00 0 0000
1.1 1000110011001100 72090 72090 11 0 1101
1.2 1001100110011001 78643 78643 10 0 1010
1.3 1010011001100110 85197 85197 01 0 0110
1.4 1011001100110011 91750 91750 00 0 0011
1.5 1100000000000000 98304 98304 00 0 0000
1.6 1100110011001100 104858 104858 11 0 1101
1.7 1101100110011001 111411 111411 10 0 1010
1.8 1110011001100110 117965 117965 01 0 0110
1.9 1111001100110011 124518 124518 00 0 0011
Obviously: if I take values that have a decimal part multiple of a power of two, there is don't need rounding:
In this case the values have an increment of 1/32
---------------- ----------------- ----------------- ----------------- ---------- ------ ----------------
10 1010000000000000 655360 655360 00 0 0000
10.0312 1010000010000000 657408 657408 00 0 0000
10.0625 1010000100000000 659456 659456 00 0 0000
10.0938 1010000110000000 661504 661504 00 0 0000
10.125 1010001000000000 663552 663552 00 0 0000
10.1562 1010001010000000 665600 665600 00 0 0000
10.1875 1010001100000000 667648 667648 00 0 0000
10.2188 1010001110000000 669696 669696 00 0 0000
10.25 1010010000000000 671744 671744 00 0 0000
10.2812 1010010010000000 673792 673792 00 0 0000
10.3125 1010010100000000 675840 675840 00 0 0000
10.3438 1010010110000000 677888 677888 00 0 0000
10.375 1010011000000000 679936 679936 00 0 0000
10.4062 1010011010000000 681984 681984 00 0 0000
10.4375 1010011100000000 684032 684032 00 0 0000
10.4688 1010011110000000 686080 686080 00 0 0000
10.5 1010100000000000 688128 688128 00 0 0000
10.5312 1010100010000000 690176 690176 00 0 0000
10.5625 1010100100000000 692224 692224 00 0 0000
10.5938 1010100110000000 694272 694272 00 0 0000
10.625 1010101000000000 696320 696320 00 0 0000
10.6562 1010101010000000 698368 698368 00 0 0000
10.6875 1010101100000000 700416 700416 00 0 0000
10.7188 1010101110000000 702464 702464 00 0 0000
10.75 1010110000000000 704512 704512 00 0 0000
10.7812 1010110010000000 706560 706560 00 0 0000
10.8125 1010110100000000 708608 708608 00 0 0000
10.8438 1010110110000000 710656 710656 00 0 0000
10.875 1010111000000000 712704 712704 00 0 0000
10.9062 1010111010000000 714752 714752 00 0 0000
10.9375 1010111100000000 716800 716800 00 0 0000
10.9688 1010111110000000 718848 718848 00 0 0000
But, if 2 <= x < 4 and the increments is not a multiple of a power of two:
Test for values from 2 <= x < 4. Increment is 0.1
Here, I am not applying the rounding in order to show how the rounding error
increase with the exponent. e.g: shift**2 - 1, where shift is exponent - 126
---------------- ----------------- ----------------- ----------------- ---------- ------ ----------------
2 1000000000000000 131072 131072 00 0 0000
2.1 1000011001100110 137626 137624 01 -2 0110
2.2 1000110011001100 144179 144176 11 -3 1101
2.3 1001001100110011 150733 150732 00 -1 0011
2.4 1001100110011001 157286 157284 10 -2 1010
2.5 1010000000000000 163840 163840 00 0 0000
2.6 1010011001100110 170394 170392 01 -2 0110
2.7 1010110011001100 176947 176944 11 -3 1101
2.8 1011001100110011 183501 183500 00 -1 0011
2.9 1011100110011001 190054 190052 10 -2 1010
3 1100000000000000 196608 196608 00 0 0000
3.1 1100011001100110 203162 203160 01 -2 0110
3.2 1100110011001100 209715 209712 11 -3 1101
3.3 1101001100110011 216269 216268 00 -1 0011
3.4 1101100110011001 222822 222820 10 -2 1010
3.5 1110000000000000 229376 229376 00 0 0000
3.6 1110011001100110 235930 235928 01 -2 0110
3.7 1110110011001100 242483 242480 11 -3 1101
3.8 1111001100110011 249037 249036 00 -1 0011
3.9 1111100110011001 255590 255588 10 -2 1010
It is clearly that the rounding is not correct, and also I have perceived that the maximun rounding error in fixed point is always 2**shift - 1
Any idea or sugerence? I have thought that the problem here is that I'm not taking into account the guard bits: GSR
, but in the other hand, if actually the problem was this: What's happens when the necessary rounding is higher than one, e.g: 2, 3, 4... ?
The ISO-C99 code below demonstrates one possible way of doing the conversion. The significand (mantissa) bits of the binary32
argument form the bits of the s15.16 result. The exponent bits tell us whether we need to shift these bits right or left to move the least significant integer bit to bit 16. If a left shift is required, rounding is not needed. If a right shift is required, we need to capture any less significant bits discarded. The most significant discarded bit is the round
bit, all others collectively represent the sticky
bit. Using the literal definition of the rounding mode, we need to round up if (1) either the round bit and the sticky bit are set, or (2) the round bit is set and the sticky bit clear (i.e., we have a tie case), but the least significant bit of the intermediate result is odd.
Note that real hardware implementations often deviate from such a literal application of the rounding-mode logic. One common scheme is to first increment the result when the round
bit is set. Then, if such an increment occurred, clear the least significant bit of the result if the sticky
bit is not set. It is easy to see that this achieves the same effect by enumerating all possible combinations of round bit, sticky bit, and result LSB.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
uint32_t float_as_uint32 (float a)
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
#define FP32_MANT_FRAC_BITS (23)
#define FP32_EXPO_BITS (8)
#define FP32_EXPO_MASK ((1u << FP32_EXPO_BITS) - 1)
#define FP32_MANT_MASK ((1u << FP32_MANT_FRAC_BITS) - 1)
#define FP32_MANT_INT_BIT (1u << FP32_MANT_FRAC_BITS)
#define FP32_SIGN_BIT (1u << (FP32_MANT_FRAC_BITS + FP32_EXPO_BITS))
#define FP32_EXPO_BIAS (127)
#define FX15P16_FRAC_BITS (16)
int32_t fp32_to_fixed (float a)
/* split binary32 operand into constituent parts */
uint32_t ia = float_as_uint32 (a);
uint32_t expo = (ia >> FP32_MANT_FRAC_BITS) & FP32_EXPO_MASK;
uint32_t mant = expo ? ((ia & FP32_MANT_MASK) | FP32_MANT_INT_BIT) : 0;
int32_t sign = ia & FP32_SIGN_BIT;
/* compute and clamp shift count */
int32_t shift = (expo - FP32_EXPO_BIAS) - FRAC_BITS_DIFF;
shift = (shift < (-31)) ? (-31) : shift;
shift = (shift > ( 31)) ? ( 31) : shift;
/* shift left or right so least significant integer bit becomes bit 16 */
uint32_t shifted_right = mant >> (-shift);
uint32_t shifted_left = mant << shift;
/* capture discarded bits if right shift */
uint32_t discard = mant << (32 + shift);
/* round to nearest or even if right shift */
uint32_t round = (discard & 0x80000000) ? 1 : 0;
uint32_t sticky = (discard & 0x7fffffff) ? 1 : 0;
uint32_t odd = shifted_right & 1;
shifted_right = (round & (sticky | odd)) ? (shifted_right + 1) : shifted_right;
shifted_right = (round) ? (shifted_right + 1) : shifted_right;
shifted_right = (round & ~sticky) ? (shifted_right & ~1) : shifted_right;
/* make final selection between left shifted and right shifted */
int32_t res = (shift < 0) ? shifted_right : shifted_left;
/* negate if negative */
return (sign < 0) ? (-res) : res;
int main (void)
int32_t res, ref;
float x;
printf ("IEEE-754 binary32 to S15.16 fixed-point conversion in RNE mode\n");
printf ("use %s implementation of round to nearest or even\n",
USE_LITERAL_RND_DEF ? "literal" : "alternate");
/* test positive half-plane */
x = 0.0f;
while (x < 0x1.0p15f) {
ref = (int32_t) rint ((double)x * 65536);
res = fp32_to_fixed(x);
if (res != ref) {
printf ("error @ x = % 14.6a: res=%08x ref=%08x\n", x, res, ref);
printf ("Test FAILED\n");
x = nextafterf (x, INFINITY);
/* test negative half-plane */
x = -1.0f * 0.0f;
while (x >= -0x1.0p15f) {
ref = (int32_t) rint ((double)x * 65536);
res = fp32_to_fixed(x);
if (res != ref) {
printf ("error @ x = % 14.6a: res=%08x ref=%08x\n", x, res, ref);
printf ("Test FAILED\n");
x = nextafterf (x, -INFINITY);
printf ("Test PASSED\n");