Search code examples
floating-pointprecisionieee-754integer-overflowfloating-point-conversion

How could you divide an IEEE754 64-bit double by 1000 on a platform only supporting 32-bit float?


I've got an electricity meter connected to a DCS (distributed control system) by PROFIBUS. The meter (Siemens Sentron PAC3200) supplies its count as an IEEE 754 double in Wh (watt-hours). Also, the counter overflows at 1.0e12 Wh or 1,000 GWh. (Cutaway scene: Several years earlier, Siemens development labs. "Let's see, how to transfer a 40-bit unsigned integer value? Let's use double!")

My goal is to log the count consistently in kWh precision.

The DCS however only supports single precision floats. So if I took the direct route, i.e. squeezed the data into a float, then at about seven decimal digits errors would appear in the kWh reading, i.e. at the latest from about 100,000,000 Wh or 100 MWh. The current count is 600 MWh already, so this is no feasible way.

So for now, I put the mantissa into an unsigned double integer (UDINT, 32 bits on this platform) and perform the conversion according to IEEE 754, which yields the correct value in Wh. This however entails an overflow at 2^32 Wh or about 4.3 GWh, which will last us barely ten years.

Since I need only kWh precision, I had the idea of dividing by 1000 early in the conversion. This would put the variable overflow at 4,300 GWh, and the meter's internal counter already overflows at 1,000 GWh. Problem solved, in theory.

As IEEE 754 is a binary floating point format however, I can only easily divide by 1024 (right shifting 10 times), which introduces a substantial error. Multiplying with a correction factor of 1.024 afterwards would only ever happen in single precision on this platform, nullifying the previous effort.

Another option would be to output a "high" and "low" UDINT in Wh from the conversion, then I could at least in theory calculate back to kWh, but this seems awkward (and -ful).

I'm having the subtle feeling I may have overlooked something (single-person Groupthink so to speak); I'm open for any other ideas how I could obtain the 1/1000th of the transferred double value.

Thanks and best regards

Björn

P.S.: For your viewing pleasure, this is the solution based on @EricPostpischil's answer -- tailored to platform and task specifics. The language used is SCL (structured control language) as per EN 61131-3, which is kind of a Pascal dialect.

FUNCTION_BLOCK PAC3200KON_P

VAR_INPUT
    INH : DWORD;
    INL : DWORD;
END_VAR

VAR_OUTPUT
    OUT : UDINT;
    SGN : BOOL;
END_VAR

VAR
    significand:              UDINT;
    exponent, i, shift:       INT;
    sign:                     BOOL;
    d0, d1, y0, y1, r1, temp: DWORD;
END_VAR
(*
    Convert the energy count delivered by Siemens Sentron PAC3200
    (IEEE 754 binary64 format, a.k.a. double) into an UDINT.

    Peculiarities:
    - This hardware platform only supports binary32 (a.k.a. float).

    - The Sentron's internal counter overflows at 1.0e12 Wh (1000 GWh).

    - kWh resolution suffices.

    - If you converted the double directly to UDINT and divided by 1000
      afterwards, the range would be reduced to (2^32-1)/1000 GWh or about
      4.295 GWh.

    - This is why this function first divides the significand by 1000
      and then proceeds with conversion to UDINT. This expands the
      range to (2^32-1) GWh or about 4295 GWh, which isn't reachable in
      practice since the device's internal counter overflows before.

    Background:

    IEEE 754 binary64 bit assignment:

               High-Byte                         Low-Byte
    66665555555555444444444433333333 3322222222221111111111
    32109876543210987654321098765432 10987654321098765432109876543210
    GEEEEEEEEEEESSSSSSSSSSSSSSSSSSSS SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

    G: sign (1: negative)
    E: exponent (biased; subtract 1023) (11 bits)
    S: significand (52 bits)
*)

(*
    significand: Bits 19...0 of high byte und complete low byte

    The significand is initially divided by 1000 using integer division. The
    bits are divided into two parts:

    - d1 contains the 31 most significant bits (plus leading 1)
    - d0 contains the next less significant bits

    In total, we use 48 bits of the original significand.
*)

(* d1: insert significand bits from high byte *)
d1 := INH AND     2#0000_0000_0000_1111_1111_1111_1111_1111;
(* result:        2#0000_0000_0000_HHHH_HHHH_HHHH_HHHH_HHHH *)

(* add the 1 before the binary point *)
d1 := d1 OR       2#0000_0000_0001_0000_0000_0000_0000_0000;
(* result:        2#0000_0000_0001_HHHH_HHHH_HHHH_HHHH_HHHH *)

(* "flush left" shift 11 places *)
d1 := d1 * 2048;
(* result:        2#1HHH_HHHH_HHHH_HHHH_HHHH_H000_0000_0000 *)

(* Insert another 11 bits from low byte (msb ones) *)
d1 := d1 OR (INL / 2097152);
(* result:        2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL *)

(* Base-65536 division. Integer divide by 1000 and save remainder *)
y1 := d1 / 1000;
r1 := TO_DW(TO_UD(d1) MOD 1000);

(*
   The significand now has leading zeroes. Shift left to make space
   at the other end.
*)
FOR shift := 1 TO 31 BY 1 DO
    y1 := y1 * 2;
    IF (y1 AND 2#1000_0000_0000_0000_0000_0000_0000_0000) <> 0 THEN
        EXIT;
    END_IF;
END_FOR;

(*
   d0: insert next 16 bits from the low byte
   (right shift five times and zero out the leading places)
*)
(* bits:             2#xxxx_xxxx_xxxL_LLLL_LLLL_LLLL_LLLx_xxxx *)
d0 := (INL / 32) AND 2#0000_0000_0000_0000_1111_1111_1111_1111;
(* result:           2#0000_0000_0000_0000_LLLL_LLLL_LLLL_LLLL *)

(* Now divide by 1000, factoring in remainder from before *)
y0 := ((r1 * 65536) OR d0) / 1000;

(*
   y1 and y0 contain results from division by 1000. We'll now build a 32 bit
   significand from these.

   y1 = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HHxx_xxxx_xxxx
   y0 = 2#0000_0000_0000_0000_LLLL_LLLL_LLLL_LLLL

   y1 has an uncertain number of zeroes at its end, resulting from the above
   left shifting (number of steps inside variable "shift"). Fill those with the
   most significant bits from y0.

   y0 has 16 valid bits (0..15). Shift right so that the "highest place zero"
   in y1 corresponds with the MSB from y0. (shift by 16-shift)

   y1 = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HHxx_xxxx_xxxx (ex.: shift=10)
   y0 = 2#0000_0000_0000_0000_0000_00LL_LLLL_LLLL
                              ------>^
*)

FOR i := 1 TO 16 - shift BY 1 DO
    y0 := y0 / 2;
END_FOR;

significand := TO_UD(y1 OR y0);
(* Result: 32-bit significand *)

(*
    Exponent: bits (62-32)...(59-32) or bits 30...20 of high byte, respectively

    Coded with bias of 1023 (needs to be subtracted).

    Special cases as per standard:
    - 16#000: signed zero or underflow (map to zero)
    - 16#7FF: inifinite or NaN (map to overflow)
*)
temp := 2#0111_1111_1111_0000_0000_0000_0000_0000 AND INH;
temp := temp / 1048576 ; (* right shift 20 places (2^20) *)
exponent := TO_IN(TO_DI(temp));
exponent := exponent - 1023; (* remove bias *)

(*
   Above, we already left shifted "shift" times, which needs to be taken into
   account here by shifting less.
*)
exponent := exponent - shift;

(*
    The significand will be output as UDINT, but was initially a binary64 with
    binary point behind the leading 1, after which the coded exponent must be
    "executed".

    temp = 2#1.HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL

    As UDINT, this already corresponds to a 31-fold left shift.

    Exponent cases as per IEEE 754:

    - exponent < 0:            result < 1
    - exponent = 0:       1 <= result < 2
    - exponent = x > 0: 2^x <= result < 2^(x+1)

    The UDINT output (32 bit) allows us to represent exponents right up to 31.
    Everything above is mapped to UDINT's maximum value.

    Now determine, after the de facto 31-fold left shift, what shifts remain
    "to do".
*)

IF exponent < 0 THEN
    (* underflow: < 2^0 *)
    significand := 0;
ELSIF exponent > 31 THEN
    (* overflow: > 2^32 - 1 *)
    significand := 4294967295;
ELSE
    (*
        result is significand * 2^exponent or here, as mentioned above,
        significand * 2^(31-exponent).

        The loop index i is the "shift target" after loop execution, which is
        why it starts at 31-1.

        Example: exponent = 27, but de facto we've already got a shift of 31.
        So we'll shift back four times to place the binary point at the right
        position (30, 29, 28, 27):

        before: temp = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL_LLLL.

        after:  temp = 2#1HHH_HHHH_HHHH_HHHH_HHHH_HLLL_LLLL.LLLL
                                                           ^<---|
    *)
    FOR i := 30 TO exponent BY -1 DO
        significand := significand / 2;
    END_FOR;
END_IF;

(*
    sign: bit 63 of high byte
*)
sign := (2#1000_0000_0000_0000_0000_0000_0000_0000 AND INH) <> 0;

OUT := significand;
SGN := sign;

END_FUNCTION_BLOCK

The test data I used:

  high byte     low byte  decimal value
=======================================
16#41c558c3, 16#2d3f331e,       716_277
16#41EFFFFF, 16#5E000000,     4_294_966
16#41EFFFFF, 16#DB000000,     4_294_967
16#41F00000, 16#2C000000,     4_294_968
16#426D1A94, 16#A1830000,   999_999_999
16#426D1A94, 16#A2000000, 1_000_000_000
16#426D1A94, 16#A27D0000, 1_000_000_001
16#428F3FFF, 16#FFC18000, 4_294_967_294
16#428F3FFF, 16#FFE0C000, 4_294_967_295
16#428F4000, 16#00000000, 4_294_967_296

BTW, integer literals of the form b#1234 in SCL basically mean "the number 1234 in base b". Underscores are ignored (they're digit separators for improved readability like e.g. Python has them).


Solution

  • /*  This program shows two methods of dividing an integer exceeding 32 bits
        by 1000 using unsigned 32-bit integer arithmetic.
    */
    
    
    #include <inttypes.h>
    #include <stdint.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    
    /*  If the count is less than 2**35, we can shift three bits (divide by 8) and
        then divide by 125 using 32-bit unsigned arithmetic.
    */
    static uint32_t ShiftThenDivide(uint64_t x)
    {
        uint32_t y = x >> 3;
        return y / 125;
    }
    
    
    /*  Given any count less than 1000*2**32 (which exceeds the 2**40 requirement),
        we can perform long division in radix 65536.
    */
    static uint64_t LongDivision(uint64_t x)
    {
        /*  Set d1 to the high two base-65536 digits (bits 17 to 31) and d0 to
            the low digit (bits 0 to 15).
        */
        uint32_t d1 = x >> 16, d0 = x & 0xffffu;
    
        //  Get the quotient and remainder of dividing d1 by 1000.
        uint32_t y1 = d1 / 1000, r1 = d1 % 1000;
    
        /*  Combine the previous remainder with the low digit of the dividend and
            divide by 1000.
        */
        uint32_t y0 = (r1<<16 | d0) / 1000;
    
        //  Return a quotient formed from the two quotient digits.
        return y1 << 16 | y0;
    }
    
    
    static void Test(uint64_t x)
    {
        //  Use 64-bit arithmetic to get a reference result.
        uint32_t y0 = x / 1000;
    
        //  ShiftThenDivide only works up to 2**35, so only test up to that.
        if (x < UINT64_C(1) << 35)
        {
            uint32_t y1 = ShiftThenDivide(x);
            if (y1 != y0)
            {
                printf("Error, 0x%" PRIx64 " / 1000 = 0x%" PRIx32 ", but ShiftThenDivide produces 0x%" PRIx32 ".\n",
                    x, y0, y1);
                exit(EXIT_FAILURE);
            }
        }
    
        //  Test LongDivision.
        uint32_t y2 = LongDivision(x);
        if (y2 != y0)
        {
            printf("Error, 0x%" PRIx64 " / 1000 = 0x%" PRIx32 ", but LongDivision produces 0x%" PRIx32 ".\n",
                x, y0, y2);
            exit(EXIT_FAILURE);
        }
    }
    
    
    int main(void)
    {
        srandom(time(0));
    
        //  Test all possible values for the upper eight bits.
        for (uint64_t upper = 0; upper < 1<<8; ++upper)
        {
            //  Test some edge cases.
            uint64_t x = upper << 32;
            Test(x);
            Test(x+1);
            Test(x-1 & 0xffffffffffu);
                /*  When x is zero, x-1 would wrap modulo 2**64, but that is
                    outside our supported domain, so wrap modulo 2**40.
                */
    
            //  Test an assortment of low 32 bits.
            for (int i = 0; i < 1000; ++i)
            {
                uint32_t r0 = random() & 0xffffu, r1 = random() & 0xffffu;
                uint64_t lower = r1 << 16 | r0;
                Test(x | lower);
            }
        }
    }