I'm trying to use the IIR filter from application note AVR223, which is written in the IAR dialect of AVR assembly. The main reason being that in C you can only do full 32*32=32 bits multiplications. So I'm especially interested in the following macros:
MUL_MOV_24 MACRO
// Multiply (signed) high bytes of both coefficient and sample work registers.
// Then copy the resulting low byte into the accumulator's high byte
// (sign bit will be correct).
muls COEFH, DATAH
mov AC2, R0
// Multiply (unsigned) low bytes of the coefficient and sample work registers.
// Copy the resulting high & low byte to the accumulator's low & middle bytes.
mul COEFL, DATAL
mov AC0, R0
mov AC1, R1
// Multiply (signed-unsigned) high coefficient byte and low sample byte.
// Add resulting low byte to accumulator's middle byte. (May generate carry!)
// Add, with carry, resulting high byte to accumulator's high byte.
mulsu COEFH, DATAL
add AC1, R0
adc AC2, R1
// Multiply (signed-unsigned) high sample byte and low coefficient byte.
// Add resulting low byte to accumulator's middle byte. (May generate carry!)
// Add, with carry, resulting high byte to accumulator's high byte.
mulsu DATAH, COEFL
add AC1, R0
adc AC2, R1
ENDM
// SMAC_24 does the same thing as MUL_MOV_24 except it adds to the accumulator
// from the start, instead of overwriting.
SMAC_24 MACRO
muls COEFH, DATAH
add AC2, R0
mul COEFL, DATAL
add AC0, R0
adc AC1, R1 // This may generate a carry..
adc AC2, ZERO // ..which must be added to accumulator's high byte!
mulsu COEFH, DATAL
add AC1, R0
adc AC2, R1
mulsu DATAH, COEFL
add AC1, R0
adc AC2, R1
ENDM
Which I tried to convert to the following functions:
int32_t mul_mov_24(int16_t coef, int16_t data) {
int32_t ac = 0;
asm (
"muls %B[COEF], %B[DATA] \n\t"
"mov %C[AC], r0 \n\t"
"mul %A[COEF], %A[DATA] \n\t"
"mov %A[AC], r0 \n\t"
"mov %B[AC], r1 \n\t"
"mulsu %B[COEF], %A[DATA] \n\t"
"add %B[AC], r0 \n\t"
"adc %C[AC], r1 \n\t"
"mulsu %B[DATA], %A[COEF] \n\t"
"add %B[AC], r0 \n\t"
"adc %C[AC], r1 \n\t"
: [AC] "=r" (ac)
: [COEF] "a" (coef),
[DATA] "a" (data)
: "r0", "r1");
return ac;
}
void smac_24(int32_t *ac, int16_t coef, int16_t data) {
asm (
"clr r2 \n\t"
"muls %B[COEF], %B[DATA] \n\t"
"add %C[AC], r0 \n\t"
"mul %A[COEF], %A[DATA] \n\t"
"add %A[AC], r0 \n\t"
"add %B[AC], r1 \n\t"
"adc %C[AC], r2 \n\t"
"mulsu %B[COEF], %A[DATA] \n\t"
"add %B[AC], r0 \n\t"
"adc %C[AC], r1 \n\t"
"mulsu %B[DATA], %A[COEF] \n\t"
"add %B[AC], r0 \n\t"
"adc %C[AC], r1 \n\t"
: "=r" (*ac)
: [COEF] "a" (coef),
[DATA] "a" (data),
[AC] "0" (*ac)
: "r0", "r1", "r2");
}
However, I must be doing something silly, as depending on how I call them and in which order I get completely bullshit results, or even a reset. I have the feeling I'm doing something wrong with input, output and clobbering.
Links to the full code and application note: https://github.com/pepijndevos/accessibletuner/blob/master/tuner/iir.h http://www.microchip.com/wwwappnotes/appnotes.aspx?appnote=en592139
There are at least four obvious bugs in that inline asm code:
Bug: Zero-Reg (R1) must contain 0
avr-gcc assumes that R1 (__zero_reg__
) always contains a value of zero. You may temporarily write a different value into that register, for example by means of MUL
, but after that sequence you must restore the value of zero. See also section Register Layout in the avr-gcc ABI.
Also notice that just clobbering R1 does not do the trick because R1 is a fixed register, i.e. the register allocator won't generate code for it. You can add clobbers to indicate side effects to someone who is reading the code, but you still have to clear R1 by hand at the end of the inline assembly.
Bug: Early-clobber in Inline Assembly
The 1st inline assembly snippet has the following output and input operands:
: [AC] "=r" (ac)
: [COEF] "a" (coef),
[DATA] "a" (data)
Now it is completely fine for the compiler to allocate ac
to registers that may overlap coef
and / or data
because the inputs die at the asm.
If you read the inputs after you started writing to the output(s), then you are going to overwrite and clobber the inputs. Your constraints are only correct when you consumed all the inputs before writing to the output(s), which is not the case for your code.
Correct code is to early-clobber the output:
: [AC] "=&r" (ac)
Bug: ac = 0
is void
You are setting ac = 0
prior to the 1st asm, where you claim that the inline asm is writing all 32 bits of ac
by means of output constraint =r (ac)
. This means that the compiler is allowed to optimize away the initialization of ac
. Even if ac
was initialized, then the high-byte would always be zero, which is a strange result as the output is signed.
What you can do is to use __int24
, or you can set %D[AC]
in the inline asm as needed.
Similar in the 2nd inline asm: You are never setting the high-byte of *ac
, so the high-byte will contain the old value.
Bug: add add adc
is wrong
The sequence
"clr r2" "\n\t"
...
"mul %A[COEF], %A[DATA] \n\t"
"add %A[AC], r0 \n\t"
"add %B[AC], r1 \n\t"
"adc %C[AC], r2 \n\t"
is wrong and sub-optimal. The 2nd add
should be an adc
.
Moreover, the sequence is using R2 just once, hence no need to use an extra (and also expensive because callee-saved) register. You can use CLR
which does not change the Carry flag:
"mul %A[COEF], %A[DATA] \n\t"
"add %A[AC], r0 \n\t"
"adc %B[AC], r1 \n\t" // Was ADD
"clr r1" "\n\t"
"adc %C[AC], r1 \n\t" // No need to use / clobber R2.
Suboptimal Code: Use movw
instead of 2×mov
The 1st asm has
"mov %A[AC], r0 \n\t"
"mov %B[AC], r1 \n\t"
If an AVR device has mul
, then it also has movw
. Hence:
"movw %A[AC], r0 \n\t"
Notice that ac
will always start at an even register number.
Suboptimal Code: Use inline Functions
Using inline function(s) will reduce call overhead and will improve register layout, because there is no need to funnel the values through the registers imposed by the ABI. Moreover, int32_t *ac
means you took the address of some (likely) local variable, which means that variable has to live in the frame. Not a good choice. Hence:
static inline int32_t mul_mov_24 (int16_t coef, int16_t data)
and similar for the 2nd function, maybe even with __attribute__((__always_inline__))
.