I'm currently trying out avr-llvm (a llvm that supports AVR as a target). My main goal is to use it's hopefully better optimizer (compared to the one of gcc) to achieve smaller binaries. If you know a little about AVRs you know that you've got only few memory.
I currently work with an ATTiny45, 4KB Flash and 256 Bytes (just bytes not KB!) of SRAM.
I was trying compile a simple C program (see below), to check what assembly code is produced and how the machine-code size is developing. I used "clang -Oz -S test.c" to produce assembly output and to optimize it for minimal size. My problem are the needlessly saved register values, knowing that this method would never return.
How can I tell llvm that it can just clobber any register, if needed without saving/restoring it's content? Any ideas how to optimize it even more (e.g. more efficient setup of stack)?
Here is my test program. As mentioned above it was compiled using "clang -Oz -S test.c".
#include <stdint.h>
void __attribute__ ((noreturn)) main() {
volatile uint8_t res = 1;
while (1) {}
}
As you can see it has just one "volatile" variable of type uint8_t (if I don't set it to volatile everything would be optimized out). This variable is set to 1. And there is an endless loop at the end. Now let us have a look at the assembly output:
.file "test.c"
.text
.globl main
.align 2
.type main,@function
main:
push r28
push r29
in r28, 61
in r29, 62
sbiw r29:r28, 1
in r0, 63
cli
out 62, r29
out 63, r0
out 61, r28
ldi r24, 1
std Y+1, r24
.BB0_1:
rjmp .BB0_1
.tmp0:
.size main, .tmp0-main
Yeah! That's a lot of machine code for such a simple program. I just tested some variations and had a look into the reference manual of the AVR... so I can explain what happens. Let's have a look at each part.
This here is the "beef", which is just doing what our c program is about. It loads r24 with value "1" which is stored into memory at Y+1 (Stack Pointer + 1). And there is of course our endless loop:
ldi r24, 1
std Y+1, r24
.BB0_1:
rjmp .BB0_1
Note: that the endless loop is needed. Else the __attribute__ ((noreturn))
is ignored and the stack pointer + saved registers are restored later.
Just before that the pointer in "Y" is set up:
in r28, 61
in r29, 62
sbiw r29:r28, 1
in r0, 63
cli
out 62, r29
out 63, r0
out 61, r28
What happens here is:
This setup of the stack registers seems to be inefficient. To increment the stack pointer I would just need to "push r0" to the stack. Then I could just load the value of SPH/SPL into r29:r28. How ever, this would probably need some changes to llvm's optimizer in source code. The above code makes just sense if more than 3 byte of stack have to be reserved for local variables (even if optimizing -O3, for -Oz it makes sense for up to 6 bytes). HOW EVER... I guess we need to touch the source of llvm for that; so this is out of scope.
More interesting is this part:
push r28
push r29
As main() is not intended to return, this doesn't make sense. This just wastes RAM and flash memory for silly instructions (remember: we have only 64, 128 or 256 bytes SRAM available in some devices).
I investigated this a bit further: If we let main return (e.g. no endless loop) the stack pointer is restored, we have a "ret" instruction at the end AND the registers r28 and r29 are restored from stack via "pop r29, pop 28". But the compiler should know, that if scope of the function "main" is never left, then all registers can be clobbered without having them stored to the stack.
This problem seems just a bit "silly" as we speak about 2 bytes RAM. But just think about what happens if the program starts using the rest of the registers.
All this really changed my view at current "compilers". I thought today there wouldn't be much room for optimization via assembler. But it seems there is...
So, still the question is...
Do you have any idea how to improve this situation (except for filing a bug report / feature request)?
I mean: Are there just some compiler switches I might have overlooked...?
Using __attribute__ ((OS_main))
works for avr-gcc.
Output is as following:
.file "test.c"
__SREG__ = 0x3f
__SP_H__ = 0x3e
__SP_L__ = 0x3d
__CCP__ = 0x34
__tmp_reg__ = 0
__zero_reg__ = 1
.global __do_copy_data
.global __do_clear_bss
.text
.global main
.type main, @function
main:
push __tmp_reg__
in r28,__SP_L__
in r29,__SP_H__
/* prologue: function */
/* frame size = 1 */
ldi r24,lo8(1)
std Y+1,r24
.L2:
rjmp .L2
.size main, .-main
This is (to my opinion) optimal in size (6 instructions or 12 bytes) and also in speed for this sample program. Is there any equivalent attribute for llvm? (clang version '3.2 (trunk 160228) (based on LLVM 3.2svn)' does neither know about OS_task nor knows anything about OS_main).
The answer to the question asked is somewhat brought up by Anton in his comment: the problem is not in LLVM, it is in your AVR target. For example, here is an equivalent program run through Clang and LLVM for other targets:
% cat test.c
__attribute__((noreturn)) int main() {
volatile unsigned char res = 1;
while (1) {}
}
% ./bin/clang -c -o - -S -Oz test.c # I'm on an x86-64 machine
<snip>
main: # @main
.cfi_startproc
# BB#0: # %entry
movb $1, -1(%rsp)
.LBB0_1: # %while.body
# =>This Inner Loop Header: Depth=1
jmp .LBB0_1
.Ltmp0:
.size main, .Ltmp0-main
.cfi_endproc
% ./bin/clang -c -o - --target=armv6-unknown-linux-gnueabi -S -Oz test.c
<snip>
main:
sub sp, sp, #4
mov r0, #1
strb r0, [sp, #3]
.LBB0_1:
b .LBB0_1
.Ltmp0:
.size main, .Ltmp0-main
% ./bin/clang -c -o - --target=powerpc64-unknown-linux-gnu -S -Oz test.c
<snip>
main:
.align 3
.quad .L.main
.quad .TOC.@tocbase
.quad 0
.text
.L.main:
li 3, 1
stb 3, -9(1)
.LBB0_1:
b .LBB0_1
.long 0
.quad 0
.Ltmp0:
.size main, .Ltmp0-.L.main
As you can see for all three of these targets, the only code generated is to reserve stack space (if necessary, it isn't on x86-64) and set the value on the stack. I think this is minimal.
That said, if you do find problems with LLVM's optimizer, the best way to get help is to send email to the development mailing list or to file bugs if you have a specific input IR sequence that should produce more minimal output IR.
Finally, to answer the questions asked in comments on your question: there are actually areas where LLVM's optimizer is significantly more powerful than GCC. However, there are also areas where it is significantly less powerful. =] Benchmark the code you care about.