Search code examples
gccsignal-processingavrinline-assemblyavr-gcc

How can I port IAR AVR assembly smac_24 to GCC?


I'm trying to use the IIR filter from application note AVR223, which is written in the IAR dialect of AVR assembly. The main reason being that in C you can only do full 32*32=32 bits multiplications. So I'm especially interested in the following macros:

MUL_MOV_24 MACRO
 // Multiply (signed) high bytes of both coefficient and sample work registers.
 // Then copy the resulting low byte into the accumulator's high byte
 // (sign bit will be correct).
    muls  COEFH, DATAH
    mov   AC2,   R0

 // Multiply (unsigned) low bytes of the coefficient and sample work registers.
 // Copy the resulting high & low byte to the accumulator's low & middle bytes.
    mul  COEFL, DATAL
    mov  AC0,   R0
    mov  AC1,   R1

 // Multiply (signed-unsigned) high coefficient byte and low sample byte.
 // Add resulting low byte to accumulator's middle byte. (May generate carry!)
 // Add, with carry, resulting high byte to accumulator's high byte.
    mulsu  COEFH, DATAL
    add    AC1,   R0
    adc    AC2,   R1

 // Multiply (signed-unsigned) high sample byte and low coefficient byte.
 // Add resulting low byte to accumulator's middle byte. (May generate carry!)
 // Add, with carry, resulting high byte to accumulator's high byte.
    mulsu  DATAH, COEFL
    add    AC1,   R0
    adc    AC2,   R1
    ENDM


// SMAC_24 does the same thing as MUL_MOV_24 except it adds to the accumulator
// from the start, instead of overwriting.
SMAC_24 MACRO
    muls  COEFH, DATAH
    add   AC2,   R0

    mul   COEFL, DATAL
    add   AC0,   R0
    adc   AC1,   R1    // This may generate a carry..
    adc   AC2,   ZERO  // ..which must be added to accumulator's high byte!

    mulsu  COEFH, DATAL
    add    AC1,   R0
    adc    AC2,   R1

    mulsu  DATAH, COEFL
    add    AC1,   R0
    adc    AC2,   R1
    ENDM

Which I tried to convert to the following functions:

int32_t mul_mov_24(int16_t coef, int16_t data) {
  int32_t ac = 0;
  asm (
  "muls  %B[COEF], %B[DATA] \n\t"
  "mov   %C[AC],   r0 \n\t"

  "mul  %A[COEF], %A[DATA] \n\t"
  "mov  %A[AC],   r0 \n\t"
  "mov  %B[AC],   r1 \n\t"

  "mulsu  %B[COEF], %A[DATA] \n\t"
  "add    %B[AC],   r0 \n\t"
  "adc    %C[AC],   r1 \n\t"

  "mulsu  %B[DATA], %A[COEF] \n\t"
  "add    %B[AC],   r0 \n\t"
  "adc    %C[AC],   r1 \n\t"
  : [AC] "=r" (ac)
  : [COEF] "a" (coef),
    [DATA] "a" (data)
  : "r0", "r1");

  return ac;
}

void smac_24(int32_t *ac, int16_t coef, int16_t data) {
  asm (
  "clr r2 \n\t"
  "muls  %B[COEF], %B[DATA] \n\t"
  "add   %C[AC],   r0 \n\t"

  "mul  %A[COEF], %A[DATA] \n\t"
  "add  %A[AC],   r0 \n\t"
  "add  %B[AC],   r1 \n\t"
  "adc  %C[AC],   r2 \n\t"

  "mulsu  %B[COEF], %A[DATA] \n\t"
  "add    %B[AC],   r0 \n\t"
  "adc    %C[AC],   r1 \n\t"

  "mulsu  %B[DATA], %A[COEF] \n\t"
  "add    %B[AC],   r0 \n\t"
  "adc    %C[AC],   r1 \n\t"
  : "=r" (*ac)
  : [COEF] "a" (coef),
    [DATA] "a" (data),
    [AC] "0" (*ac)
  : "r0", "r1", "r2");
}

However, I must be doing something silly, as depending on how I call them and in which order I get completely bullshit results, or even a reset. I have the feeling I'm doing something wrong with input, output and clobbering.

Links to the full code and application note: https://github.com/pepijndevos/accessibletuner/blob/master/tuner/iir.h http://www.microchip.com/wwwappnotes/appnotes.aspx?appnote=en592139


Solution

  • There are at least four obvious bugs in that inline asm code:

    Bug: Zero-Reg (R1) must contain 0

    avr-gcc assumes that R1 (__zero_reg__) always contains a value of zero. You may temporarily write a different value into that register, for example by means of MUL, but after that sequence you must restore the value of zero. See also section Register Layout in the avr-gcc ABI.

    Also notice that just clobbering R1 does not do the trick because R1 is a fixed register, i.e. the register allocator won't generate code for it. You can add clobbers to indicate side effects to someone who is reading the code, but you still have to clear R1 by hand at the end of the inline assembly.

    Bug: Early-clobber in Inline Assembly

    The 1st inline assembly snippet has the following output and input operands:

      : [AC] "=r" (ac)
      : [COEF] "a" (coef),
        [DATA] "a" (data)
    

    Now it is completely fine for the compiler to allocate ac to registers that may overlap coef and / or data because the inputs die at the asm.

    If you read the inputs after you started writing to the output(s), then you are going to overwrite and clobber the inputs. Your constraints are only correct when you consumed all the inputs before writing to the output(s), which is not the case for your code.

    Correct code is to early-clobber the output:

      : [AC] "=&r" (ac)
    

    Bug: ac = 0 is void

    You are setting ac = 0 prior to the 1st asm, where you claim that the inline asm is writing all 32 bits of ac by means of output constraint =r (ac). This means that the compiler is allowed to optimize away the initialization of ac. Even if ac was initialized, then the high-byte would always be zero, which is a strange result as the output is signed.

    What you can do is to use __int24, or you can set %D[AC] in the inline asm as needed.

    Similar in the 2nd inline asm: You are never setting the high-byte of *ac, so the high-byte will contain the old value.

    Bug: add add adc is wrong

    The sequence

      "clr r2"          "\n\t"
      ...
      "mul  %A[COEF], %A[DATA] \n\t"
      "add  %A[AC],   r0 \n\t"
      "add  %B[AC],   r1 \n\t"
      "adc  %C[AC],   r2 \n\t"
    

    is wrong and sub-optimal. The 2nd add should be an adc.

    Moreover, the sequence is using R2 just once, hence no need to use an extra (and also expensive because callee-saved) register. You can use CLR which does not change the Carry flag:

      "mul  %A[COEF], %A[DATA] \n\t"
      "add  %A[AC],   r0 \n\t"
      "adc  %B[AC],   r1 \n\t" // Was ADD
      "clr  r1"         "\n\t"
      "adc  %C[AC],   r1 \n\t" // No need to use / clobber R2.
    

    Suboptimal Code: Use movw instead of 2×mov

    The 1st asm has

      "mov  %A[AC],   r0 \n\t"
      "mov  %B[AC],   r1 \n\t"
    

    If an AVR device has mul, then it also has movw. Hence:

      "movw %A[AC],   r0 \n\t"
    

    Notice that ac will always start at an even register number.

    Suboptimal Code: Use inline Functions

    Using inline function(s) will reduce call overhead and will improve register layout, because there is no need to funnel the values through the registers imposed by the ABI. Moreover, int32_t *ac means you took the address of some (likely) local variable, which means that variable has to live in the frame. Not a good choice. Hence:

    static inline int32_t mul_mov_24 (int16_t coef, int16_t data) 
    

    and similar for the 2nd function, maybe even with __attribute__((__always_inline__)).