To keep n previous values, is memmove faster than a modulo operation?

I'm writing a digital filter, and I need to keep the last X values and sum them all together.

Now there are two possible approaches to this. Either I shift the whole array using memmove to make room for the next value, and have the right indexes to the array as hard-coded values in my summing algorithm.

memmove(&Fifo[0], &Fifo[1], 12 * 4); // Shift array to the left

Result += Factor[1] * (Fifo[5] + Fifo[7]);
Result += Factor[2] * (Fifo[4] + Fifo[8]);
Result += Factor[3] * (Fifo[3] + Fifo[9]);
Result += Factor[4] * (Fifo[2] + Fifo[10]);
Result += Factor[5] * (Fifo[1] + Fifo[11]);
Result += Factor[6] * (Fifo[0] + Fifo[12]);

Or alternatively, I don't copy any memory, but increment a counter instead, and calculate each index from that using a modulo operation (like a circular buffer).

i++; // Increment the index

Result += Factor[1] * (Fifo[(i + 5) % 13] + Fifo[(i + 7) % 13]);
Result += Factor[2] * (Fifo[(i + 4) % 13] + Fifo[(i + 8) % 13]);
Result += Factor[3] * (Fifo[(i + 3) % 13] + Fifo[(i + 9) % 13]);
Result += Factor[4] * (Fifo[(i + 2) % 13] + Fifo[(i + 10) % 13]);
Result += Factor[5] * (Fifo[(i + 1) % 13] + Fifo[(i + 11) % 13]);
Result += Factor[6] * (Fifo[(i + 0) % 13] + Fifo[(i + 12) % 13]);

Since its an embedded ARM cpu, I was wondering what would be more efficient. Since I assume that the CPU has to move at least one 32-bit value internally to do the modulo operation, could it be that just moving the whole array is just as fast as calculating the right indexes?

Solution

I've done exactly that on a Cortex M0 (LPC11C14) for a FIR filter of size 15 (Savitzky-Golay for measuring line voltage).

I found that in my case copying was somewhat slower than using a circular buffer of size 16 and computing the indices using the modulo operator. Note that 16 is a power of two, which makes division very cheap.

I tried several variants and used a port pin for measuring execution time, I recommend you do the same.