Search code examples
algorithmbinarynumbersieee-754

IEEE754 single precision - General algorithm for representing the half of a number


Suppose N is an arbitrary number represented according to IEEE754 single precision standards. I want to find the most precise possible representation of N/2 again in IEEE754.

I want to find a general algorithm (described in words, I just want the necessary steps and cases to take into account) for obtaining the representation.

My approach is :

Say the number is represented: b0_b1_b2_b3...b_34.

  1. Isolate the first bit which determines the sign (-/+) of the number.
  2. Calculate the representation of the power (p) from the unsigned representation b_1...b_11.
  3. If power = 128 we have a special case. If all the bits of the mantissa are equal to 0, we have, depending on b_0, either minus or plus infinity. We don't change anything. If the mantissa has at least one bit equal to 1 then we have NaN value. Again we change nothing.
  4. if e is inside]-126, 127[then we have a normalized mantissam. The new power p can be calculated asp' = p - 1and belongs in the interval]-127, 126]. We then calculatem/2` and we represent it starting from the right and losing any bits that cannot be included in the 23 bits of the mantissa.
  5. If e = -126, then in calculating the half of this number we pass in a denormalized mantissa. We represent p = 127, calculate the half of the mantissa and represent it again starting from the right losing any information that cannot be included.
  6. Finally if e = -127 we have a denormalized mantissa. As long as m/2 can be represented in the number of bits available in the mantissa without losing information we represent that and keep p = -127. In any other case we represent the number as a positive or negative 0 depending on b_0

Any steps I have missed, any improvements ( I am sure there are ) that can be made or anything that seems completely wrong?


Solution

  • I implemented a divide by two algorithm in Java and verified it for all 32-bit inputs. I tried to follow your pseudocode, but there were three places where I diverged. First, the infinity/NaN exponent is 128. Second, in case 4 (normal -> normal), there's no need to operate on the fraction. Third, you didn't describe how round half to even works when you do operate on the fraction. LGTM otherwise.

    public final class FloatDivision {
      public static float divideFloatByTwo(float value) {
        int bits = Float.floatToIntBits(value);
        int sign = bits >>> 31;
        int biased_exponent = (bits >>> 23) & 0xff;
        int exponent = biased_exponent - 127;
        int fraction = bits & 0x7fffff;
        if (exponent == 128) {
          // value is NaN or infinity
        } else if (exponent == -126) {
          // value is normal, but result is subnormal
          biased_exponent = 0;
          fraction = divideNonNegativeIntByTwo(0x800000 | fraction);
        } else if (exponent == -127) {
          // value is subnormal or zero
          fraction = divideNonNegativeIntByTwo(fraction);
        } else {
          // value and result are normal
          biased_exponent--;
        }
        return Float.intBitsToFloat((sign << 31) | (biased_exponent << 23) | fraction);
      }
    
      private static int divideNonNegativeIntByTwo(int value) {
        // round half to even
        return (value >>> 1) + ((value >>> 1) & value & 1);
      }
    
      public static void main(String[] args) {
        int bits = Integer.MIN_VALUE;
        do {
          if (bits % 0x800000 == 0) {
            System.out.println(bits);
          }
          float value = Float.intBitsToFloat(bits);
          if (Float.floatToIntBits(divideFloatByTwo(value)) != Float.floatToIntBits(value / 2)) {
            System.err.println(bits);
            break;
          }
        } while (++bits != Integer.MIN_VALUE);
      }
    }