An empty cycle in C. Does the compiler generate lots of unnecessary code or did I miss something?

I am new to AVR assembly language and decided to look inside the code of a dumb delay function written in C to see how long an empty cycle with long arithmetics could take.

The delay function is as follows:

void delay(uint32_t cycles) {
    for (volatile uint32_t i = 0; i < cycles; i++) {}
}

I disassembled it with objdump and, I think, got some strange results (see four questions in the comments):

00000080 <delay>:
void delay (uint32_t cycles) {                  
; `cycles` is stored in r22..r25
  80:   cf 93           push    r28
  82:   df 93           push    r29
; First one: why does the compiler rcall the next position relative to the following
; two instructions? Some stack management?
  84:   00 d0           rcall   .+0             ; 0x86 <delay+0x6>
  86:   00 d0           rcall   .+0             ; 0x88 <delay+0x8>
  88:   cd b7           in      r28, 0x3d       ; 61
  8a:   de b7           in      r29, 0x3e       ; 62
  8c:   ab 01           movw    r20, r22
  8e:   bc 01           movw    r22, r24
; Now `cycles` is in r20..r23
    for (volatile uint32_t i = 0; i < cycles; i++) {}
; r1 was earlier initialized with zero by `eor r1, r1`
; `i` is in r24..r27
  90:   19 82           std     Y+1, r1 ; 0x01
  92:   1a 82           std     Y+2, r1 ; 0x02
  94:   1b 82           std     Y+3, r1 ; 0x03
  96:   1c 82           std     Y+4, r1 ; 0x04
  98:   89 81           ldd     r24, Y+1        ; 0x01
  9a:   9a 81           ldd     r25, Y+2        ; 0x02
  9c:   ab 81           ldd     r26, Y+3        ; 0x03
  9e:   bc 81           ldd     r27, Y+4        ; 0x04
  a0:   84 17           cp      r24, r20
  a2:   95 07           cpc     r25, r21
  a4:   a6 07           cpc     r26, r22
  a6:   b7 07           cpc     r27, r23
  a8:   a0 f4           brcc    .+40            ; 0xd2 <delay+0x52>, to location A
; location B:
; Third (yes, before the second) one: why does it load the registers each time after
; comparing the counter with the limit if `cp`, `cpc` do not change the registers?
  aa:   89 81           ldd     r24, Y+1        ; 0x01
  ac:   9a 81           ldd     r25, Y+2        ; 0x02
  ae:   ab 81           ldd     r26, Y+3        ; 0x03
  b0:   bc 81           ldd     r27, Y+4        ; 0x04
  b2:   01 96           adiw    r24, 0x01       ; 1
  b4:   a1 1d           adc     r26, r1
  b6:   b1 1d           adc     r27, r1
; Second one: why does it store and load the same registers with unchanged values?
; If it needs to store the registers, why does it load anyway? Does `std` change the
; source registers?
  b8:   89 83           std     Y+1, r24        ; 0x01
  ba:   9a 83           std     Y+2, r25        ; 0x02
  bc:   ab 83           std     Y+3, r26        ; 0x03
  be:   bc 83           std     Y+4, r27        ; 0x04
  c0:   89 81           ldd     r24, Y+1        ; 0x01
  c2:   9a 81           ldd     r25, Y+2        ; 0x02
  c4:   ab 81           ldd     r26, Y+3        ; 0x03
  c6:   bc 81           ldd     r27, Y+4        ; 0x04
  c8:   84 17           cp      r24, r20
  ca:   95 07           cpc     r25, r21
  cc:   a6 07           cpc     r26, r22
  ce:   b7 07           cpc     r27, r23
  d0:   60 f3           brcs    .-40            ; 0xaa <delay+0x2a>, to location B
}
; Location A:
; Finally, fourth one: so, under my first question it issued an `rcall` twice and now 
; just pops the return addresses to nowhere? Now the `rcall`s are double-strange
  d2:   0f 90           pop     r0
  d4:   0f 90           pop     r0
  d6:   0f 90           pop     r0
  d8:   0f 90           pop     r0
  da:   df 91           pop     r29
  dc:   cf 91           pop     r28
  de:   08 95           ret

So after all, why does it need all those actions?

UPD

Full code:

#include <avr/io.h>

void delay (uint32_t cycles)
{
    for (volatile uint32_t i = 0; i < cycles; i++) {}
}

int main(void)
{
    DDRD |= 1 << DDD2 | 1 << DDD3 | 1 << DDD4 | 1 << DDD5;
    PORTD |= 1 << PORTD2 | 1 << PORTD4;
    while (1) 
    {
        const uint32_t d = 1000000;
        delay(d);
        PORTD ^= 1 << PORTD2 | 1 << PORTD3;
        delay(d);
        PORTD ^= 1 << PORTD4 | 1 << PORTD5;
        delay(d);
        PORTD ^= 1 << PORTD3 | 1 << PORTD2;
        delay(d);
        PORTD ^= 1 << PORTD5 | 1 << PORTD4;
    }
}

Compiler: gcc version 5.4.0 (AVR_8_bit_GNU_Toolchain_3.6.0_1734)

Build command:

avr-gcc.exe  -x c -funsigned-char -funsigned-bitfields -DDEBUG  -I%inc_folder%  -O1 -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -g2 -Wall -mmcu=atmega328p -B %atmega328p_folder% -c -std=gnu99 -MD -MP %sources, object files, etc%

A reply to the cautions about the delay function:

Yes, I fully understand the possible problems with such approach to the delay function, namely the not-so-predictable timing and the risk of optimizing out the cycle. This is only a self-educational example to see what an empty cycle is compiled into

Solution

First of all, please note that writing delays using a busy loop like this is not good since the timing will depend on the details of how your compiler operates. For the AVR platform, use the built-in delay functions provide by avr-libc and GCC, as described in JLH's answer.

Double rcall and four pops

Normally, an rcall +0 instruction at the top of a function would be a handy way to double the number of times the function runs. But in this case, we can see that the return addresses are not being returned to, they are in fact being removed from the stack at the end of the function with four pop instructions.

So at the beginning of the function, the compiler is adding four bytes to the stack and at the end of the function it is removing four bytes from the stack. This is how the compiler allocates storage for your variable, i. Since i is a local variable, it generally gets stored on the stack. Compiler optimizations might allow the variable to be stored in registers, but I don't think such optimizations are allowed for volatile variables. This answers your first and fourth questions.

Extra loads and stores

You marked your variable i as volatile, which tells the compiler it cannot make any assumptions about the memory that i is stored in. Every time your code reads or writes to i, the compiler must generate a real read or write to the RAM locations that hold i; it's not allowed to make the optimizations you thought it would make. This answers your second and third questions.

The volatile keyword is useful for special functions registers on your chip, and it is useful for variables that are shared between the main loop and an interrupt.