Efficient Embedded Fixed Point 2x2 Matrix Multiplication in ARM Cortex-M4 C code

I am trying to implement a VERY efficient 2x2 matrix multiplication in C code for operation in an ARM Cortex-M4. The function accepts 3 pointers to 2x2 arrays, 2 for the inputs to be multiplied and an output buffer passed by the using function. Here is what I have so far...

static inline void multiply_2x2_2x2(int16_t a[2][2], int16_t b[2][2], int32_t c[2][2])
{
  int32_t a00a01, a10a11, b00b01, b01b11;

  a00a01 = a[0][0] | a[0][1]<<16;
  b00b10 = b[0][0] | b[1][0]<<16;
  b01b11 = b[0][1] | b[1][1]<<16;
  c[0][0] = __SMUAD(a00a01, b00b10);
  c[0][1] = __SMUAD(a00a01, b01b11);

  a10a11 = a[1][0] | a[1][1]<<16;
  c[1][0] = __SMUAD(a10a11, b00b10);
  c[1][1] = __SMUAD(a10a11, b01b11);
}

Basically, my strategy is to to use the ARM Cortex-M4 __SMUAD() function to do the actual multiply accumulates. But this requires me to build the inputs a00a01, a10a11, b00b10, and b01b11 ahead of time. My question is, given that the C array should be a continuous in memory, is there a more efficient wat to pass the data into the functions directly without the intermediate variables? Secondary question, am I overthinking this and I should just let the compiler do its job as it is smarter than I am? I tend to do that a lot.

Thanks!

Solution

The first main concern is that some_signed_int << 16 invokes undefined behavior for negative numbers. So you have bugs all over. And then bitwise OR of two int16_t where either is negative does not necessarily form a valid int32_t either. Do you actually need the sign or can you drop it?

ARM examples use unsigned int, which in turn supposedly contains 2x int16_t in raw binary form. This is what you actually want too.

Also it would seem that it shouldn't matter for SMUAD which 16 bit word you place where. So the a[0][0] | a[0][1]<<16; just serves to needlessly swap data around in memory. It will confuse the compiler which can't optimize such code well. Sure, shifts etc are always very fast, but this is pointless overhead.

(As someone noted, this whole thing is likely much easier to write in pure assembler without concern of all the C type rules and undefined behavior.)

To avoid all these issues you could define your own union type:

typedef union
{
  int16_t  i16 [2][2];
  uint32_t u32 [2];
} mat2x2_t;

u32[0] corresponds to i16[0][0] and i16[0][1]
u32[1] corresponds to i16[1][0] and i16[1][1]

C actually lets you "type pun" between these types pretty wildly (unlike C++). Unions also dodge the brittle strict aliasing rules.

The function can then become something along the lines of this pseudo code:

static uint32_t mat_mul16 (mat2x2_t a, mat2x2_t b)
{
   uint32_t c0 = __SMUAD(a.u32[0], b.u32[0]);
   ...
}

Supposedly each such line should give 2x signed 16 multiplications as per the SMUAD instruction.

As for if this actually gives some revolutionary performance increase compared to some default MUL, I kind of doubt it. Disassemble and count CPU ticks.

am I overthinking this and I should just let the compiler do its job as it is smarter than I am?

Most likely :) The old rule of thumb: benchmark and then only manually optimize at the point when you've actually found a performance bottleneck.