Search code examples
calgorithmbit-manipulationmicro-optimizationbcd

Fast BCD addition


This question was inspired by a question I recently came across on Stackoverflow. It reminded me that early in the development of the x86-64 ISA I had written 32-bit x86 code for BCD addition without using the BCD support instructions available in the x86 ISA. This refers to what, based on comments, may be known on some platforms as packed BCD, that is, one decimal digit stored per tetrade of bits commonly referred to as a nibble.

While it was pretty obvious that removal of the DAA and DAS instructions (decimal adjust after addition / subtraction) was safe from a performance perspective, it was best to have concrete proof in hand how one might go about implementing, say, BCD addition without these instructions, on an Athlon processor of the day. My code from back then, dug up from the depths of the internet, slightly adjusted and with a bug fix applied, is included below. Note that RCR was not the big performance killer then that it is now. I revisited the issue from today's perspective, choosing to stay with processing in 32-bit chunks for ease of exposition across architectures. How to extend to processing of 64-bit chunks on 64-bit ISAs is obvious.

Conceptually, BCD addition is straightforward. In order to facilitate inter-nibble carry propagation, one temporarily adds 6 to every nibble of the binary sum of the two source operands. For those nibbles that do not produce a carry-out, one either subtracts 6 from the nibble (undo method) or one repeats the original binary addition, but without adding 6 to that nibble (redo method). When working on my original code in 1999 I found that use of the redo method leads to slightly higher performance, possibly due to a shorter dependency chain. The main overhead in BCD addition is to determined where inter-nibble carries occur. My approach was to examine the LSB of the next higher nibble: If its value differs from the preliminary value of the sum bit (the XOR of the two source LSBs), a carry-in must have occurred, so the next lower nibble produced a carry-out. A somewhat ugly special case is the handling of a carry-out from the most significant nibble.

When Knuth published volume 4a of TAOCP in 2011, I found that it contained an algorithm for BCD addition with the undo method and detecting inter-nibble carries by inspecting the MSB of corresponding nibbles in the source operands and the adjusted sum to check whether there is a carry-out from that nibble. This makes for exceedingly elegant code when the latter functionality is expressed via the majority-of-3 computation, which Knuth prefers to refer to by the more generic name median.

What I have found experimentally is that the redo method is generally superior to the undo method across a wide swath of architectures: x86, ARM, RISC-V, Sparc, Power. As for detection of nibble carries, the LSB method and the MSB method generally seem to neck-and-neck in terms of performance, but Knuth's MSB method is superior on architectures that can implement the median operation more efficiently by taking advantage of additional logic operations (such as ORN, ANDN, XNOR, NOR, NAND) beyond the minimal set of logic operations (OR, AND, XOR, NOT). This is especially true on modern NVIDIA GPUs that provide a LOP3 instruction that can compute any logical operation of three inputs. This maps a median operation, with any of the inputs possibly negated, to a single instance of LOP3.

While I consider my bit-twiddling skills to be quite solid, past experience has shown that I am often just at the 85% mark compared to the best solution proposed in answers here. Specifically I want to find out: Are there more efficient ways to deal with the nibble-carry issue? Use of a simple ISA like 32-bit RISC-V may provide the best insights into that. Any superior solution applicable to that ISA is unlikely to hurt on ISAs with a wider variety of available instructions. Beyond that any clever idea that can reduce the overhead of nibble-carry detection, as evidenced by a reduced instruction count or shortened dependency chain, is welcome.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

#define NUM_TESTS    (100000000)

#define TEST_PORTABLE_BCD64 (1) // 0: x86-specific test of 18-digit BCD arithm.
                                // 1: portable test of 16-digit BCD arithmetic
#define PORTABLE     (1)  // 0: x86-specific inline assembly
#define NATIVE_64BIT (0)  // applies only if PORTABLE==1
#define USE_KNUTH    (1)  // applies only if PORTABLE==1 && NATIVE_64BIT==0

#if PORTABLE && !NATIVE_64BIT
/* majority-of-3 operation that Knuth refers to more generically as median */
uint32_t median32 (uint32_t x, uint32_t y, uint32_t z)
{
    return (x | (y & z)) & (y | z);
}

/* 8-digit BCD addition with carry-out */
uint32_t BCD32_ADDcc (uint32_t x, uint32_t y, uint32_t *cy)
{
#if USE_KNUTH
    // derived from Knuth, TAOCP 4a (2011), answer to exercise 100 section 7.1.3
    uint32_t z, u, t;
    z = y + 0x66666666;
    u = x + z;
    t = median32 (x, z, ~u) & 0x88888888;
    *cy = t >> 31;
    return (x + y) + (t - (t >> 2));
#else // ! USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t xe, s, u, carry_out;
    xe = x + sixes;         // add "nibble offset"; cannot overflow
    s = xe ^ y;             // preliminary sum bits
    u = xe + y;
    carry_out = u < xe;     // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = s & eights;         // isolate nibble lsbs with carry-in
    *cy = carry_out;        // record carry-out from most significant nibble
    u = (x + y) + (s - (s >> 2)); // add "nibble offset" to nibbles that need it
    return u;
#endif // USE_KNUTH
}

/* 8-digit BCD addition with carry-in */
uint32_t BCD32_ADDC (uint32_t x, uint32_t y, uint32_t cy)
{
#if USE_KNUTH
    // derived from Knuth, TAOCP 4a (2011), answer to exercise 100 section 7.1.3
    uint32_t z, u, t;
    x = x + cy;
    z = y + 0x66666666;
    u = x + z;
    t = median32 (x, z, ~u) & 0x88888888;
    return (x + y) + (t - (t >> 2));
#else // !USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t xe, s, u, carry_out;
    y = y + cy;             // incorporate carry-in; cannot overflow
    xe = x + sixes;         // add "nibble offset"; cannot overflow
    s = xe ^ y;             // preliminary sum bits
    u = xe + y;              
    carry_out = u < xe;     // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = s & eights;         // isolate nibble lsbs with carry-in
    u = (x + y) + (s - (s >> 2)); // add "nibble offset" to nibbles that need it
    return u;
#endif // USE_KNUTH
}

/* 8-digit BCD addition with carry-in and carry-out */
uint32_t BCD32_ADDCcc (uint32_t x, uint32_t y, uint32_t *cy)
{
#if USE_KNUTH
    // derived from Knuth, TAOCP 4a (2011), answer to exercise 100 section 7.1.3
    uint32_t z, u, t;
    x = x + (*cy);
    z = y + 0x66666666;
    u = x + z;
    t = median32 (x, z, ~u) & 0x88888888;
    *cy = t >> 31;
    return (x + y) + (t - (t >> 2));
#else // !USE_KNUTH
    const uint32_t sixes = 0x66666666;
    const uint32_t eights = 0x88888888;
    uint32_t xe, s, u, carry_out;
    y = y + (*cy);          // incorporate carry-in; cannot overflow
    xe = x + sixes;         // add "nibble offset"; cannot overflow
    s = xe ^ y;             // preliminary sum bits
    u = xe + y;              
    carry_out = u < xe;     // carry-out from most significant nibble
    s = (carry_out << 31) | ((s ^ u) >> 1); // sum bits with carry-in
    s = s & eights;         // isolate nibble lsbs with carry-in
    *cy = carry_out;        // record carry-out from most significant nibble
    u = (x + y) + (s - (s >> 2)); // add "nibble offset" to nibbles that need it
    return u;
#endif // USE_KNUTH
}
#endif // PORTABLE && !NATIVE_64BIT

#if PORTABLE
#if NATIVE_64BIT
/* majority-of-3 operation that Knuth refers to more generically as median */
uint64_t median64 (uint64_t x, uint64_t y, uint64_t z)
{
    return (x | (y & z)) & (y | z);
}

/* 16-digit BCD addition */
uint64_t BCD64_ADD (uint64_t x, uint64_t y)
{
    // derived from Knuth, TAOCP 4a (2011), answer to exercise 100 section 7.1.3
    uint64_t z, u, t;
    z = y + 0x6666666666666666ULL;
    u = x + z;
    t = median64 (x, z, ~u) & 0x8888888888888888ULL;
    return (x + y) + (t - (t >> 2));
}
#else // !NATIVE_64BIT
/* 16-digit BCD addition */
uint64_t BCD64_ADD (uint64_t x, uint64_t y)
{
    uint32_t x_lo = (uint32_t)x;
    uint32_t x_hi = (uint32_t)(x >> 32);
    uint32_t y_lo = (uint32_t)y;
    uint32_t y_hi = (uint32_t)(y >> 32);
    uint32_t cy;
    uint32_t r_lo = BCD32_ADDcc (x_lo, y_lo, &cy);
    uint32_t r_hi = BCD32_ADDC  (x_hi, y_hi,  cy);
    return ((uint64_t)r_hi << 32) | r_lo;
}
#endif // NATIVE_64BIT
#else // !PORTABLE
/*
  16-digit BCD addition; N. Juffa, about 1999. Code retrieved on 3/24/2024 from 
  https://web.archive.org/web/20001211131100/http://www.azillionmonkeys.com/qed/asmexample.html
  Bug fixed and adjusted for Intel compiler's inline asm (may need -fasm-blocks)
*/
uint64_t BCD64_ADD (uint64_t x, uint64_t y)
{
    uint64_t r;

    __asm mov   eax, dword ptr [x]   ; // x (lo)
    __asm mov   ebx, dword ptr [y]   ; // y (lo)
    __asm mov   edx, dword ptr [x+4] ; // x (hi)
    __asm mov   ecx, dword ptr [y+4] ; // y (hi)
    // x in EDX:EAX, y in EDX:EBX. Add least significant 8 BCD digits
    __asm mov   esi, eax             ; // x
    __asm lea   edi, [eax+66666666h] ; // x + 0x66666666
    __asm xor   esi, ebx             ; // x ^ y
    __asm add   eax, ebx             ; // x + y
    __asm shr   esi, 1               ; // t1 = (x ^ y) >> 1
    __asm add   edi, ebx             ; // x + y + 0x66666666
    __asm sbb   ebx, ebx             ; // capture carry: carry ? (-1) : 0
    __asm rcr   edi, 1               ; // t2 = (x + y + 0x66666666) >> 1
    __asm xor   esi, edi             ; // t1 ^ t2
    __asm and   esi, 88888888h       ; // t3 = (t1 ^ t2) & 0x88888888
    __asm add   eax, esi             ; // x + y + t3
    __asm shr   esi, 2               ; // t3 >> 2
    __asm sub   eax, esi             ; // x + y + t3 - (t3 >> 2)
    // Add most significant 8 BCD digits
    __asm sub   ecx, ebx             ; // y - carry // propagate carry
    __asm mov   esi, edx             ; // x
    __asm lea   edi, [edx+66666666h] ; // x + 0x66666666
    __asm xor   esi, ecx             ; // x ^ y
    __asm add   edx, ecx             ; // x + y
    __asm shr   esi, 1               ; // t1 = (x ^ y) >> 1
    __asm add   edi, ecx             ; // x + y + 0x66666666
    // __asm sbb   ebx, ebx             ; capture carry: carry ? (-1) : 0
    __asm rcr   edi, 1               ; // t2 = (x + y + 0x66666666) >> 1
    __asm xor   esi, edi             ; // t1 ^ t2
    __asm and   esi, 88888888h       ; // t3 = (t1 ^ t2) & 0x88888888
    __asm add   edx, esi             ; // x + y + t3
    __asm shr   esi, 2               ; // t3 >> 2
    __asm sub   edx, esi             ; // x + y + t3 - (t3 >> 2)
    // Now EDX:EAX = x + y
    __asm mov   dword ptr [r], eax   ; // save result (lo)
    __asm mov   dword ptr [r+4], edx ; // save result (hi)

    return r;
}
#endif // PORTABLE

/* convert binary operand x in [0, 9999] to BCD */
uint32_t cvt_bin2bcd_4dig (uint32_t x)
{
    uint32_t num, res = 0;
    /* convert to 21:11 fixed point */
    num = ((0x8312 * x) + 0x6000) >> 14; // (2**11 / 10**3) * 2**14
    /* pull out decimal digits one at a time */
    res =              (num >> 11);
    num = (num & ((1 << 11) - 1)) * 5;   // 22:10 fixed point
    res = (res << 4) | (num >> 10);
    num = (num & ((1 << 10) - 1)) * 5;   // 23:9 fixed point
    res = (res << 4) | (num >>  9);
    num = (num & ((1 <<  9) - 1)) * 5;   // 24:8 fixed point
    res = (res << 4) | (num >>  8);
    return res;
}

/* convert binary operand x in [0, 99999999] to BCD */
uint32_t cvt_bin2bcd_8dig (uint32_t x)
{
    uint32_t hi = x / 10000;
    uint32_t lo = x - hi * 10000;
    return (cvt_bin2bcd_4dig (hi) << 16) | cvt_bin2bcd_4dig (lo);
}

/* convert binary operand a in [0, 9999999999999999] to BCD */
uint64_t uint64_to_bcd64 (uint64_t x)
{
    uint32_t hi = (uint32_t)(x / 100000000ULL);
    uint32_t lo = (uint32_t)(x - hi * 100000000ULL);
    return (((uint64_t)cvt_bin2bcd_8dig (hi)) << 32) | cvt_bin2bcd_8dig (lo);   
}

/* return most significant 32 bits of the full product of two 32-bit operands */
uint32_t umul32hi (uint32_t a, uint32_t b)
{
    uint64_t prod = (uint64_t)a * b;
    return (uint32_t)(prod >> 32);
}

/* convert BCD operand x in [0, 99999999] to binary
   based on Knuth TAOCP 2, answer on p. 638 to exercise 19 in section 4.4
*/
uint32_t cvt_bcd2bin_8dig (uint32_t x)
{
    const uint32_t m1 = 0xf0f0f0f0;
    const uint32_t m2 = 0xff00ff00;
    const uint32_t m3 = 0xffff0000;
    const uint32_t c1 = 0x60000000; // 0.375000000000
    const uint32_t c2 = 0x9c000000; // 0.609375000000
    const uint32_t c3 = 0xd8f00000; // 0.847412109375
    x = x - umul32hi (c1, x & m1);
    x = x - umul32hi (c2, x & m2);
    x = x - umul32hi (c3, x & m3);
    return x;
}

/* convert BCD operand x in [0, 9999999999999999] to binary */
uint64_t bcd64_to_uint64 (uint64_t a)
{
    uint32_t hi = (uint32_t)(a >> 32);
    uint32_t lo = (uint32_t)a;
    return 100000000ULL * cvt_bcd2bin_8dig (hi) + cvt_bcd2bin_8dig (lo);
}

/*
  https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
  From: geo <[email protected]>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

#if TEST_PORTABLE_BCD64
int main (void)
{
    uint64_t bx, by, x, y, ref, res, i;

    printf ("Testing 16-digit BCD arithmetic\n");
    for (i = 0; i < NUM_TESTS; i++) {
        x = KISS64 % 10000000000000000ULL;
        y = KISS64 % 10000000000000000ULL;
        bx = uint64_to_bcd64 (x);
        by = uint64_to_bcd64 (y);
        res = BCD64_ADD (bx, by);
        ref = uint64_to_bcd64 ((x + y) % 10000000000000000ULL);
        if (res != ref) {
            printf ("!!!! x=%016llx  y=%016llx  bx=%016llx  by=%016llx  res=%016llx  ref=%016llx\n", 
                    x, y, bx, by, res, ref);
            return EXIT_FAILURE;
        }
    }
    printf ("test PASSED\n");
    return EXIT_SUCCESS;
}

#else // TEST_PORTABLE_BCD64

typedef struct ten_byte {
    uint64_t low64;
    uint16_t high16;
} ten_byte;

/* 18-digit BCD addition modulo 10**18 using the x87 FPU */
ten_byte bcd18dig_add_ref (ten_byte a, ten_byte b)
{
    const double exp10_18 = 1e18;
    const uint16_t cw_extended_chop = 0xf7f;
    ten_byte aa, bb, r;
    uint16_t cw_original;

    aa = a; // workaround: bad code generated when data not local to function
    bb = b;
    __asm fstcw  [cw_original]        ;// save current CW
    __asm fldcw  [cw_extended_chop]   ;// RC=truncate PC=extended
    __asm fld    qword ptr [exp10_18] ;// 10**18
    __asm fbld   tbyte ptr [aa]       ;// a, 10**8
    __asm fbld   tbyte ptr [bb]       ;// b, a, 10**18
    __asm fadd   st, st(1)            ;// a+b, a, 10**18
    __asm fst    st(1)                ;// a+b, a+b, 10**18
    __asm fdiv   st, st(2)            ;// (a+b)/10**18, a+b, 10**18
    __asm frndint                     ;// trunc((a+b)/10**18), a+b, 10**18
    __asm fmulp  st(2), st            ;// a+b, trunc((a+b)/10**18)*10**18
    __asm fsubrp st(1),st             ;// a+b - trunc((a+b)/10**18)*10**18;
    __asm fbstp  tbyte ptr [r]        ;// <empty>
    __asm fldcw  [cw_original]        ;// restore original CW

    return r;
}

/* 18-digit BCD addition modulo 10**18, processed in 8-digit chunks */
ten_byte bcd18dig_add (ten_byte x, ten_byte y)
{
    ten_byte r;

    uint32_t x_lo = (uint32_t)(x.low64);
    uint32_t x_md = (uint32_t)(x.low64 >> 32);
    uint32_t x_hi = (uint32_t)(x.high16);
    uint32_t y_lo = (uint32_t)(y.low64);
    uint32_t y_md = (uint32_t)(y.low64 >> 32);
    uint32_t y_hi = (uint32_t)(y.high16);
    uint32_t cy;
    uint32_t r_lo = BCD32_ADDcc (x_lo, y_lo, &cy);
    uint32_t r_md = BCD32_ADDCcc(x_md, y_md, &cy);
    uint32_t r_hi = BCD32_ADDC  (x_hi, y_hi,  cy);
    r_hi &= 0xff;
    r.low64 = ((uint64_t)r_md << 32) | r_lo;
    r.high16 = (uint16_t)r_hi;

    return r;
}

int main (void)
{
    uint64_t x, y;
    ten_byte bcd_augend, bcd_addend, bcd_sum_ref, bcd_sum_res;
    
    printf ("Testing 18-digit BCD arithmetic against x87 FPU\n");
    for (int i = 0; i < NUM_TESTS; i++) {
        x = KISS64 % 10000000000000000ULL;
        y = KISS64 % 10000000000000000ULL;
        bcd_augend.low64 = uint64_to_bcd64 (x);
        bcd_addend.low64 = uint64_to_bcd64 (x);
        x = KISS64 % 100ULL;
        y = KISS64 % 100ULL;
        bcd_augend.high16 = uint64_to_bcd64 (x);
        bcd_addend.high16 = uint64_to_bcd64 (y);
        
        bcd_sum_ref = bcd18dig_add_ref (bcd_augend, bcd_addend);
        bcd_sum_res = bcd18dig_add     (bcd_augend, bcd_addend);

        if (memcmp (&bcd_sum_res, &bcd_sum_ref, 9) != 0) {
            printf ("a   = %02x%016llx\n", bcd_augend.high16, bcd_augend.low64);
            printf ("b   = %02x%016llx\n", bcd_addend.high16, bcd_addend.low64);
            printf ("res = %02x%016llx\n",bcd_sum_res.high16,bcd_sum_res.low64);
            printf ("ref = %02x%016llx\n",bcd_sum_ref.high16,bcd_sum_ref.low64);
            return EXIT_FAILURE;
        }
    }
    printf ("test PASSED\n");
    return EXIT_SUCCESS;
}
#endif // TEST_PORTABLE_BCD64

Answers only related to performance is what I desire.


Solution

  • IMO, there's little improvement to be had in the architecture agnostic formulas.
    A full-adder yields the "MSB" method: (a | b) ^ ((a ^ b) & (a + b)).
    A half-adder yields the "LSB" method: ((a & b) + ((a ^ b) >> 1)) ^ ((a ^ b) >> 1)

    A "MUX" method might also be competitive (especially if andnot is a single operation). The limited range of the bcd input is exploited to do two "cheap" overflow checks, instead of one universal overflow check.


    Consider a full-adder, the formula for each bit is:

    // where c == carry_in to each bit;
    sum = a ^ b ^ c;
    carry_out = (b & c) | (a & c) | (a & b);
    

    carry_out (aka. bitwise majority of three) can be written other ways:

    maj3 = (a | b) & ((a & b) | c);
    maj3 = (a | b) &~((a ^ b) &~c);
    maj3 = (a | b) ^ ((a ^ b) &~c); 
    maj3 = (a & b) ^ ((a ^ b) & c);
    maj3 = (a & b) | ((a ^ b) & c);
    maj3 = (a & b) | ((a | b) & c);
    maj3 = ((b ^ a) & (c ^ a)) ^ a;
    // etc.
    

    Solve for carry_in:

    sum == a + b;
    a + b == a ^ b ^ c;
    (a + b) ^ a ^ b == c;
    

    Substitute and simplify:

    carry_out = (a | b) ^ ((a ^ b) &~c);
    carry_out = (a | b) ^ ((a ^ b) &~((a + b) ^ a ^ b)); 
    carry_out = (a | b) ^ ((a ^ b) & (a + b));
    

    With arbitrary-precision carry_out == carry_in >> 1.
    With modular arithmetic, the carry_out from the most-significant bit position is lost.

    There is a half-adder based trick to calculate (a + b) / 2 without unsigned overflow.

    a + b can also be written as:

    a + b == (a & b) + (a | b);
    a + b == (a & b) + (a & b) + (a ^ b);
    // a + b == (a | b) + (a | b) - (a ^ b);
    

    so:

    a ^ b ^ c == a + b;
    a ^ b ^ c == ((a & b) << 1) + (a ^ b);
    (a ^ b ^ c) >> 1 == (a & b) + ((a ^ b) >> 1);
    c >> 1 == ((a & b) + ((a ^ b) >> 1)) ^ ((a ^ b) >> 1);
    

    Which provides these formulas for packed BCD addition:

    // a version based on a full-adder
    uint32_t bcd_add_msb(uint32_t x, uint32_t y, uint32_t *c) {
        uint32_t a, b, carries, fixup;
        a = x + 0x66666666;
        b = y + *c; // carry_in is optional until the sum
        carries = (a | y) ^ ((a ^ y) & (a + b));
        *c = carries >> 31;
        fixup = carries & 0x88888888;
        return (x + b) + (fixup - (fixup >> 2));
    }
    
    // a version based on a half-adder
    uint32_t bcd_add_lsb(uint32_t x, uint32_t y, uint32_t *c) {
        uint32_t clsum, avg, fixup;
        y += *c;
        clsum = (x ^ y) >> 1;
        avg = (x & y) + clsum + 0x33333333;
        *c = avg >> 31;
        fixup = (clsum ^ avg) & 0x88888888;
        return x + y + (fixup - (fixup >> 2));
    }
    
    // a version with two (range limited) overflow checks
    // bitwise_select => (x + y) ? (~((x + y) + 0x66666666)) : (x | y);
    uint32_t bcd_add_mux(uint32_t x, uint32_t y, uint32_t *c) {
        uint32_t sum, mux, fixup;
        sum = x + y + *c;
        mux = (sum &~ (sum + 0x66666666)) | ((x | y) &~ sum);
        *c = mux >> 31;
        fixup = mux & 0x88888888;
        return sum + (fixup - (fixup >> 2));
    }
    
    // Just for fun:
    // convert binary operand x in [0, 9999] to packed BCD
    uint64_t cvt_bin2bcd_4dig (uint64_t x)
    {
        const uint64_t mask = 0x0FFFFE1FFFF87FFF;
        uint64_t y;
        x *= 0x000418A051EC0CCD;
        y = (x & mask) * 10;
        x &= ~mask;
        y &= ~mask;
        return (uint16_t)((y >> 15) + (y >> 33) + (y >> 52) + (x >> 48));
    }
    
    /***  explanation for cvt_bin2bcd_4dig() 
    the usual divide by constant, fixed-point stuff:
    max 9999, div 10, (x * 0xCCD) >> 15 // integer part: 10
    max 9999, div 100, (x * 0x147B) >> 19 // integer part: 7
    max 9999, div 1000, (x * 0x20C5) >> 23 // integer part: 4
    
    However, since we're discarding the integer part...
    we can overlap the integer part of a 'lane' with the fractional part of the next lane.
    We have plenty of fractional precision anyways... (e.g. 26 bits required
    to calc but only 14 bits needed to round-trip)
    
    so our magic is: (0x20C5ULL << 37) + (0x147BULL << 18) + 0xCCD;
    the mask just cuts out the low 4-bits of the integer part for the result:
        mask = ~((0xFULL << 60) + (0xFULL << 37) + (0xF << 15));
    
    ***/