Search code examples
coptimizationlong-integer

Word width before and after multiplication


I can't tell if anyone had asked this question before. The root behind this question is the mistake (in my opinion) in standard C or C++ in that the implicit word width of an integer product after multiplication is the same as the larger of word widths of the integer multiplier and multiplicand.

So, this wouldn't be a problem if, in C, multiplying an int32_t times an int32_t would implicitly result in an int64_t. But it doesn't. Nearly every integer CPU and fixed-point DSP that I am aware of multiplies two N-bit operands into a register (or register pair) that is 2N bits wide and I wish that C would do that (without some kinda inline assembly).

Now I know that this works (it's a simple fixed-point Direct Form 1 biquad filter with fraction saving around the sole point of quantization):

int64_t accumulator;

int64_t a1, a2, b0, b1, b2;   /* 2nd-order IIR coefficients */

int32_t state_x1, state_x2, state_y1, state_y2, state_fraction;

int32_t x, y;

int32_t get_input(void);

void put_output(int32_t y);

int i, num_samples;
         /* num_samples is the number of samples per block */

/*  load up your coefficients, cast all to int64_t */

/*  load up your states from wherever */

accumulator = (int64_t)state_fraction;

for (i=0; i<num_samples; i++)
    {
    x = get_input();

    accumulator += b0*x;
    accumulator += b1*state_x1;
    accumulator += b2*state_x2;
    accumulator += a1*state_y1;
    accumulator += a2*state_y2;

    if (accumulator > 0x1FFFFFFFFFFFFFFF)
       {
       accumulator = 0x1FFFFFFFFFFFFFFF;     /* clip value */
       }
    if (accumulator < -0x2000000000000000)
       {
       accumulator = -0x2000000000000000;    /* clip value */
       }

    y =  (int32_t)(accumulator>>30);   /* always rounding down */

    state_x2 = state_x1;             /* bump the states over */
    state_x1 = x;
    state_y2 = state_y1;
    state_y1 = y;

    accumulator = accumulator & 0x000000003FFFFFFF;
    /* keep the fractional bits that you dropped for the next sample
       otherwise clear the accumulator */

    put_output(y);
    }

state_fraction = (int32_t)accumulator;

/*  save your states back to wherever  */

But I don't want to waste the MIPS to multiply an int64_t times an int32_t when I know that both values are 32-bit numbers. But I also know that the result must be 64-bits wide else there is trouble.

Suppose I did this:

int64_t accumulator;

int32_t a1, a2, b0, b1, b2;   /* 2nd-order IIR coefficients */

int32_t state_x1, state_x2, state_y1, state_y2, state_fraction;

int32_t x, y;

int32_t get_input(void);

void put_output(int32_t y);

int i, num_samples;
         /* num_samples is the number of samples per block */

/*  load up your coefficients, (leaving as int32_t) */

/*  load up your states from wherever */

accumulator = (int64_t)state_fraction;

for (i=0; i<num_samples; i++)
    {
    x = get_input();

    accumulator += (int64_t)b0*x;
    accumulator += (int64_t)b1*state_x1;
    accumulator += (int64_t)b2*state_x2;
    accumulator += (int64_t)a1*state_y1;
    accumulator += (int64_t)a2*state_y2;

    if (accumulator > 0x1FFFFFFFFFFFFFFF)
       {
       accumulator = 0x1FFFFFFFFFFFFFFF;     /* clip value */
       }
    if (accumulator < -0x2000000000000000)
       {
       accumulator = -0x2000000000000000;    /* clip value */
       }

    y =  (int32_t)(accumulator>>30);   /* always rounding down */

    state_x2 = state_x1;             /* bump the states over */
    state_x1 = x;
    state_y2 = state_y1;
    state_y1 = y;

    accumulator = accumulator & 0x000000003FFFFFFF;
    /* keep the fractional bits that you dropped for the next sample
       otherwise clear the accumulator */

    put_output(y);
    }

state_fraction = (int32_t)accumulator;

/*  save your states back to wherever  */

Would the compiler be smart enough to understand what I want? To multiply two 32-bit signed integers together and add the 64-bit result into a 64-bit accumulator without casting anything into a 64-bit register? (i.e. no sign-extension operation and no 64 x 64 bit multiply?)

I wish I could write this is C code without worrying that the ARM compiler will create non-optimal code. Am I forced to do this in assembly code to make sure it's done right?


Solution

  • So, this wouldn't be a problem if, in C, multiplying an int32_t times an int32_t would implicitly result in an int64_t. But it doesn't.

    No, and it very easy to check if you call this function:

    void foo(int32_t a, int32_t b)
    {
        printf("%zu\n", sizeof(a * b));
    }
    

    Would the compiler be smart enough to understand what I want?

    No, the compiler does not have to predict what you want, it has to follow the instructions written in the C Standard. You need to learn and understand the language well enough to know how datatypes will behave in your code.

    I wish I could write this C code without worrying that the ARM compiler will create non-optimal code. Am I forced to do this in assembly code to make sure it's done right?

    Trust your compiler. It will generate a very efficient machine code for your program. You need to tell the compiler what you want.

    In programming, you should not be tempted to write as short source file as possible, you need to be precise. Also, code should be easy to understand for humans.

    Also, do not worry about premature optimizations. In 99.99% not efficient code is related to bad algorithm (which is the programmer fault), not poor compiler work.

    Example:

    int64_t foo(int32_t b0, 
            int32_t b1,
            int32_t b2,
            int32_t a1,
            int32_t a2,
            int32_t state_x1,
            int32_t state_x2,
            int32_t state_y1,
            int32_t state_y2,
            int32_t x,
            int64_t accumulator)
    {
        accumulator += (int64_t)b0*x;
        accumulator += (int64_t)b1*state_x1;
        accumulator += (int64_t)b2*state_x2;
        accumulator += (int64_t)a1*state_y1;
        accumulator += (int64_t)a2*state_y2;
        return accumulator;
    }
    

    ARM (Cortex-m 32bit thumb)

    foo:
            push    {r4, lr}
            mov     ip, r1
            mov     lr, r0
            ldrd    r0, r1, [sp, #32]
            ldr     r4, [sp, #28]
            smlal   r0, r1, lr, r4
            ldr     r4, [sp, #12]
            smlal   r0, r1, ip, r4
            ldr     r4, [sp, #16]
            smlal   r0, r1, r2, r4
            ldr     r2, [sp, #20]
            smlal   r0, r1, r3, r2
            ldr     r3, [sp, #24]
            ldr     r2, [sp, #8]
            smlal   r0, r1, r2, r3
            pop     {r4, pc}