Why AVR-GCC compilers append a "clr r1" line after multiplication?

I am trying to check how AVR-GCC compiler compiles for multiplication?

Input c code:

unsigned char square(unsigned char num) {
    return num * num;
}

Output assembly code:

square(unsigned char):
        mul r24,r24
        mov r24,r0
        clr r1
        ret

My question is why it is adding the statement clr r1? Seemingly, one could have removed this statement and still got as desired, assuming the parameter is stored in r24 and the return value is available at r24.

Direct Godbolt link: https://godbolt.org/z/PsPS_N

UPDATE:

I also see related more general discussion here.

Solution

When GCC's AVR backend was implemented and the avr-gcc ABI was devised, it turned out that code generation can be improved in some situations when there is a register that is known to contain 0. The author chose R1 back then, i.e. when avr-gcc is printing assembly instructions, one may assume that R1=0 like in this example:

unsigned add (unsigned x, unsigned char y)
{
    if (x != 64)
        return x + y;
    else
        return x;
}

This compiles with -c -Os -save-temps to the code below. It uses R1 aka. __zero_reg__ so it can print a shorter instruction sequence:

__zero_reg__ = 1
add:
    cpi r24,64
    cpc r25,__zero_reg__
    breq .L2
    add r24,r22
    adc r25,__zero_reg__
.L2:
    ret

R1 was chosen because in an AVR, the higher registers are more powerful and therefore register allocation starts – with a grain of salt – at the higher registers, hence the low registers would be used last. Thus a register with a small register number was used.

This special register is not managed by the register allocator, it is "fixed" and managed by hand. This was all simple with the early AVRs which didn't support MUL instructions. With the introduction of MUL and cousins however, things got more complicated because MUL is using register pair R1:R0 as implicit output register and hence overrides the 0 held in __zero_reg__.

Thus you can implement two approaches:

Emit CLR __zero_reg__ prior to each use so R1 contains 0.
Clear that reg 'after' a sequence that clobbered it.

The avr backend implements approach 2.

Because in the current avr backend (at least up to v10) this register is managed by hand, there is no information whether clearing that register is actually needed or might be omitted:

unsigned char mul (unsigned char x)
{
    return x * x * x;
}

produces with -c -Os -mmcu=atmega8 -save-temps:

mul:
    mul r24,r24
    mov r25,r0
    clr r1
    mul r25,r24
    mov r24,r0
    clr r1
    ret

i.e. R1 is cleared twice even though right after the 1st 'CLR' the 'MUL' instruction is overriding it again. In principle, the avr backend could track which instructions clobber R1 and which instruction (sequence)s require R1=0, however this is currently (v10) not implemented.

The introduction of MUL lead to yet another complication: R1 is no more always zero, i.e. when an interrupt triggers right after a MUL then the register is in general not zero. Thus an interrupt service routine (ISR) must save+restore it when it might use R1:

#include <avr/interrupt.h>

char volatile v;

ISR (__vector_1)
{
    v = 0;
}

Compiling, assembling and then avr-objdump -d on the object file reads:

00000000 <__vector_1>:
   0:   1f 92           push    r1
   2:   1f b6           in      r1, 0x3f
   4:   1f 92           push    r1
   6:   11 24           eor     r1, r1
   8:   10 92 00 00     sts     0x0000, r1
   c:   1f 90           pop     r1
   e:   1f be           out     0x3f, r1
  10:   1f 90           pop     r1
  12:   18 95           reti

The payload of the ISR is just sts ..., r1 which stores 0 to v. This requires R1=0, hence the need for clr r1, hence save-restore R1 by means of push+pop. The clr clobbers the program status (SREG at I/O address 0x3f), thus SREG must also be saved-restored around that sequence, and in order to accomplish that the compiler is using r1 as a scratch register as special function registers cannot be used with push/pop.

Apart from that, there are situations where there is no reset of zero-reg after a MUL:

int square (int a)
{
    return a * a;
}

compiles to:

    mul  r24,r24
    movw r18,r0
    mul  r24,r25
    add  r19,r0
    add  r19,r0
    clr  r1
    movw r24,r18
    ret

The reason there is no CLR after the 1st MUL is because the multiplication sequence is internally represented and then emit as one chunk (insn), hence there is knowledge that there is no need for an intermediate CLR. In the example from above with x * x * x however, the internal representation is two insns, one for either multiplication.