c++arduino compiler-optimization avr avr-gcc

avr-gcc: (seemingly) unneeded prologue/epilogue in simple function

When trying to address individual bytes inside an uint64, AVR gcc⁽¹⁾ gives me a strange prologue/epilogue, while the same function written using uint32_t gives me a single ret (the example function is a NOP).

Why does gcc do this? How do I remove this?

You can see the code here, in Compiler Explorer.

⁽¹⁾ gcc 5.4.0 from Arduino 1.8.9 distribution, parameters=-O3 -std=c++11.

Source code:

#include <stdint.h>

uint32_t f_u32(uint32_t x) {
  union y {
    uint8_t p[4];
    uint32_t w;
  };
  return y{ .p = {
    y{ .w = x }.p[0],
    y{ .w = x }.p[1],
    y{ .w = x }.p[2],
    y{ .w = x }.p[3]
  } }.w;
}

uint64_t f_u64(uint64_t x) {
  union y {
    uint8_t p[8];
    uint64_t w;
  };
  return y{ .p = {
    y{ .w = x }.p[0],
    y{ .w = x }.p[1],
    y{ .w = x }.p[2],
    y{ .w = x }.p[3],
    y{ .w = x }.p[4],
    y{ .w = x }.p[5],
    y{ .w = x }.p[6],
    y{ .w = x }.p[7]
  } }.w;
}

Generated assembly for the uint32_t version:

f_u32(unsigned long):
  ret

Generated assembly for the uint64_t version:

f_u64(unsigned long long):
  push r28
  push r29
  in r28,__SP_L__
  in r29,__SP_H__
  subi r28,72
  sbc r29,__zero_reg__
  in __tmp_reg__,__SREG__
  cli
  out __SP_H__,r29
  out __SREG__,__tmp_reg__
  out __SP_L__,r28
  subi r28,-72
  sbci r29,-1
  in __tmp_reg__,__SREG__
  cli
  out __SP_H__,r29
  out __SREG__,__tmp_reg__
  out __SP_L__,r28
  pop r29
  pop r28
  ret

Solution

I am not sure if this is a good answer, but it is the best I can give. The assembly for the f_u64() function allocates 72 bytes on the stack and then deallocates them again (since this involves registers r28 and r29, they are saved in the beginning and restored in the end).

If you try to compile without optimization (I also skipped the c++11 flag, I do not think it makes any difference), then you will see that the f_u64() function starts by allocating 80 bytes on the stack (similar to the opening statements you see in the optimized code, just with 80 bytes instead of 72 bytes):

    in r28,__SP_L__
    in r29,__SP_H__
    subi r28,80
    sbc r29,__zero_reg__
    in __tmp_reg__,__SREG__
    cli
    out __SP_H__,r29
    out __SREG__,__tmp_reg__
    out __SP_L__,r28

These 80 bytes are actually all used. First the value of the argument x is stored (8 bytes) and then a lot of moving data around is done involving the remaining 72 bytes.

After that the 80 bytes are deallocated on the stack similar to the closing statements in the optimized code:

    subi r28,-80
    sbci r29,-1
    in __tmp_reg__,__SREG__
    cli
    out __SP_H__,r29
    out __SREG__,__tmp_reg__
    out __SP_L__,r28

My guess is that the optimizer concludes that the 8 bytes for storing the argument can be spared. Hence it needs only 72 bytes. Then it concludes that all the moving around of data can be spared. However, it fails to figure out that this means that the 72 bytes on the stack can be spared.

Hence my best bet is that this is a limitation or an error in the optimizer (whatever you prefer to call it). In that case the only "solution" is to try to shuffle the real code around to find a work-around or raise it as an error on the compiler.