How to tell clang not to save registers to stack?

The Goal

I'm currently trying out avr-llvm (a llvm that supports AVR as a target). My main goal is to use it's hopefully better optimizer (compared to the one of gcc) to achieve smaller binaries. If you know a little about AVRs you know that you've got only few memory.

I currently work with an ATTiny45, 4KB Flash and 256 Bytes (just bytes not KB!) of SRAM.

The Problem

I was trying compile a simple C program (see below), to check what assembly code is produced and how the machine-code size is developing. I used "clang -Oz -S test.c" to produce assembly output and to optimize it for minimal size. My problem are the needlessly saved register values, knowing that this method would never return.

My Questions...

How can I tell llvm that it can just clobber any register, if needed without saving/restoring it's content? Any ideas how to optimize it even more (e.g. more efficient setup of stack)?

Details / Example

Here is my test program. As mentioned above it was compiled using "clang -Oz -S test.c".

#include <stdint.h>

void __attribute__ ((noreturn)) main()  {
     volatile uint8_t res = 1;
     while (1) {}
}

As you can see it has just one "volatile" variable of type uint8_t (if I don't set it to volatile everything would be optimized out). This variable is set to 1. And there is an endless loop at the end. Now let us have a look at the assembly output:

.file   "test.c"
    .text
    .globl  main
    .align  2
    .type   main,@function
main:
    push    r28
    push    r29
    in  r28, 61
    in  r29, 62
    sbiw    r29:r28, 1
    in  r0, 63
    cli
    out 62, r29
    out 63, r0
    out 61, r28
    ldi r24, 1
    std Y+1, r24
.BB0_1:
    rjmp    .BB0_1
.tmp0:
    .size   main, .tmp0-main

Yeah! That's a lot of machine code for such a simple program. I just tested some variations and had a look into the reference manual of the AVR... so I can explain what happens. Let's have a look at each part.

This here is the "beef", which is just doing what our c program is about. It loads r24 with value "1" which is stored into memory at Y+1 (Stack Pointer + 1). And there is of course our endless loop:

ldi r24, 1
std Y+1, r24

.BB0_1:
    rjmp    .BB0_1

Note: that the endless loop is needed. Else the __attribute__ ((noreturn)) is ignored and the stack pointer + saved registers are restored later.

Just before that the pointer in "Y" is set up:

in  r28, 61
in  r29, 62
sbiw    r29:r28, 1
in  r0, 63
cli
out 62, r29
out 63, r0
out 61, r28

What happens here is:

Y (register pair r28:r29 is equivalent to "Y") is loaded from ports 61 and 62, these ports map to some "registers" namely SPL and SPH ("L"ow and "H"igh byte of the "S"tack "P"ointer)
the loaded value is decremented (sbiw r29:r28)
the changed value of the stack pointer is saved back to the ports; and I guess to avoid problems: interrupts are disabled before; the state of "cli/sti" [which is stored in register 63 (SREG)] is saved to r0 and later restored to port 63.

This setup of the stack registers seems to be inefficient. To increment the stack pointer I would just need to "push r0" to the stack. Then I could just load the value of SPH/SPL into r29:r28. How ever, this would probably need some changes to llvm's optimizer in source code. The above code makes just sense if more than 3 byte of stack have to be reserved for local variables (even if optimizing -O3, for -Oz it makes sense for up to 6 bytes). HOW EVER... I guess we need to touch the source of llvm for that; so this is out of scope.

More interesting is this part:

    push    r28
    push    r29

As main() is not intended to return, this doesn't make sense. This just wastes RAM and flash memory for silly instructions (remember: we have only 64, 128 or 256 bytes SRAM available in some devices).

I investigated this a bit further: If we let main return (e.g. no endless loop) the stack pointer is restored, we have a "ret" instruction at the end AND the registers r28 and r29 are restored from stack via "pop r29, pop 28". But the compiler should know, that if scope of the function "main" is never left, then all registers can be clobbered without having them stored to the stack.

This problem seems just a bit "silly" as we speak about 2 bytes RAM. But just think about what happens if the program starts using the rest of the registers.

All this really changed my view at current "compilers". I thought today there wouldn't be much room for optimization via assembler. But it seems there is...

So, still the question is...

Do you have any idea how to improve this situation (except for filing a bug report / feature request)?

I mean: Are there just some compiler switches I might have overlooked...?

Additional Info

Using __attribute__ ((OS_main)) works for avr-gcc.

Output is as following:

    .file   "test.c"
__SREG__ = 0x3f
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__CCP__  = 0x34
__tmp_reg__ = 0
__zero_reg__ = 1
    .global __do_copy_data
    .global __do_clear_bss
    .text
.global main
    .type   main, @function
main:
    push __tmp_reg__
    in r28,__SP_L__
    in r29,__SP_H__
/* prologue: function */
/* frame size = 1 */
    ldi r24,lo8(1)
    std Y+1,r24
.L2:
    rjmp .L2
    .size   main, .-main

This is (to my opinion) optimal in size (6 instructions or 12 bytes) and also in speed for this sample program. Is there any equivalent attribute for llvm? (clang version '3.2 (trunk 160228) (based on LLVM 3.2svn)' does neither know about OS_task nor knows anything about OS_main).

Solution

The answer to the question asked is somewhat brought up by Anton in his comment: the problem is not in LLVM, it is in your AVR target. For example, here is an equivalent program run through Clang and LLVM for other targets:

% cat test.c
__attribute__((noreturn)) int main() {
  volatile unsigned char res = 1;
  while (1) {}
}

% ./bin/clang -c -o - -S -Oz test.c  # I'm on an x86-64 machine
<snip>
main:                                   # @main
        .cfi_startproc
# BB#0:                                 # %entry
        movb    $1, -1(%rsp)
.LBB0_1:                                # %while.body
                                        # =>This Inner Loop Header: Depth=1
        jmp     .LBB0_1
.Ltmp0:
        .size   main, .Ltmp0-main
        .cfi_endproc

% ./bin/clang -c -o - --target=armv6-unknown-linux-gnueabi -S -Oz test.c
<snip>
main:
        sub     sp, sp, #4
        mov     r0, #1
        strb    r0, [sp, #3]
.LBB0_1:
        b       .LBB0_1
.Ltmp0:
        .size   main, .Ltmp0-main

% ./bin/clang -c -o - --target=powerpc64-unknown-linux-gnu -S -Oz test.c
<snip>
main:
        .align  3
        .quad   .L.main
        .quad   .TOC.@tocbase
        .quad   0
        .text
.L.main:
        li 3, 1
        stb 3, -9(1)
.LBB0_1:
        b .LBB0_1
        .long   0
        .quad   0
.Ltmp0:
        .size   main, .Ltmp0-.L.main

As you can see for all three of these targets, the only code generated is to reserve stack space (if necessary, it isn't on x86-64) and set the value on the stack. I think this is minimal.

That said, if you do find problems with LLVM's optimizer, the best way to get help is to send email to the development mailing list or to file bugs if you have a specific input IR sequence that should produce more minimal output IR.

Finally, to answer the questions asked in comments on your question: there are actually areas where LLVM's optimizer is significantly more powerful than GCC. However, there are also areas where it is significantly less powerful. =] Benchmark the code you care about.