I am new to AVR assembly language and decided to look inside the code of a dumb delay function written in C to see how long an empty cycle with long arithmetics could take.
The delay function is as follows:
void delay(uint32_t cycles) {
for (volatile uint32_t i = 0; i < cycles; i++) {}
}
I disassembled it with objdump
and, I think, got some strange results (see four questions in the comments):
00000080 <delay>:
void delay (uint32_t cycles) {
; `cycles` is stored in r22..r25
80: cf 93 push r28
82: df 93 push r29
; First one: why does the compiler rcall the next position relative to the following
; two instructions? Some stack management?
84: 00 d0 rcall .+0 ; 0x86 <delay+0x6>
86: 00 d0 rcall .+0 ; 0x88 <delay+0x8>
88: cd b7 in r28, 0x3d ; 61
8a: de b7 in r29, 0x3e ; 62
8c: ab 01 movw r20, r22
8e: bc 01 movw r22, r24
; Now `cycles` is in r20..r23
for (volatile uint32_t i = 0; i < cycles; i++) {}
; r1 was earlier initialized with zero by `eor r1, r1`
; `i` is in r24..r27
90: 19 82 std Y+1, r1 ; 0x01
92: 1a 82 std Y+2, r1 ; 0x02
94: 1b 82 std Y+3, r1 ; 0x03
96: 1c 82 std Y+4, r1 ; 0x04
98: 89 81 ldd r24, Y+1 ; 0x01
9a: 9a 81 ldd r25, Y+2 ; 0x02
9c: ab 81 ldd r26, Y+3 ; 0x03
9e: bc 81 ldd r27, Y+4 ; 0x04
a0: 84 17 cp r24, r20
a2: 95 07 cpc r25, r21
a4: a6 07 cpc r26, r22
a6: b7 07 cpc r27, r23
a8: a0 f4 brcc .+40 ; 0xd2 <delay+0x52>, to location A
; location B:
; Third (yes, before the second) one: why does it load the registers each time after
; comparing the counter with the limit if `cp`, `cpc` do not change the registers?
aa: 89 81 ldd r24, Y+1 ; 0x01
ac: 9a 81 ldd r25, Y+2 ; 0x02
ae: ab 81 ldd r26, Y+3 ; 0x03
b0: bc 81 ldd r27, Y+4 ; 0x04
b2: 01 96 adiw r24, 0x01 ; 1
b4: a1 1d adc r26, r1
b6: b1 1d adc r27, r1
; Second one: why does it store and load the same registers with unchanged values?
; If it needs to store the registers, why does it load anyway? Does `std` change the
; source registers?
b8: 89 83 std Y+1, r24 ; 0x01
ba: 9a 83 std Y+2, r25 ; 0x02
bc: ab 83 std Y+3, r26 ; 0x03
be: bc 83 std Y+4, r27 ; 0x04
c0: 89 81 ldd r24, Y+1 ; 0x01
c2: 9a 81 ldd r25, Y+2 ; 0x02
c4: ab 81 ldd r26, Y+3 ; 0x03
c6: bc 81 ldd r27, Y+4 ; 0x04
c8: 84 17 cp r24, r20
ca: 95 07 cpc r25, r21
cc: a6 07 cpc r26, r22
ce: b7 07 cpc r27, r23
d0: 60 f3 brcs .-40 ; 0xaa <delay+0x2a>, to location B
}
; Location A:
; Finally, fourth one: so, under my first question it issued an `rcall` twice and now
; just pops the return addresses to nowhere? Now the `rcall`s are double-strange
d2: 0f 90 pop r0
d4: 0f 90 pop r0
d6: 0f 90 pop r0
d8: 0f 90 pop r0
da: df 91 pop r29
dc: cf 91 pop r28
de: 08 95 ret
So after all, why does it need all those actions?
Full code:
#include <avr/io.h>
void delay (uint32_t cycles)
{
for (volatile uint32_t i = 0; i < cycles; i++) {}
}
int main(void)
{
DDRD |= 1 << DDD2 | 1 << DDD3 | 1 << DDD4 | 1 << DDD5;
PORTD |= 1 << PORTD2 | 1 << PORTD4;
while (1)
{
const uint32_t d = 1000000;
delay(d);
PORTD ^= 1 << PORTD2 | 1 << PORTD3;
delay(d);
PORTD ^= 1 << PORTD4 | 1 << PORTD5;
delay(d);
PORTD ^= 1 << PORTD3 | 1 << PORTD2;
delay(d);
PORTD ^= 1 << PORTD5 | 1 << PORTD4;
}
}
Compiler: gcc version 5.4.0 (AVR_8_bit_GNU_Toolchain_3.6.0_1734)
Build command:
avr-gcc.exe -x c -funsigned-char -funsigned-bitfields -DDEBUG -I%inc_folder% -O1 -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -g2 -Wall -mmcu=atmega328p -B %atmega328p_folder% -c -std=gnu99 -MD -MP %sources, object files, etc%
A reply to the cautions about the delay function:
Yes, I fully understand the possible problems with such approach to the delay function, namely the not-so-predictable timing and the risk of optimizing out the cycle. This is only a self-educational example to see what an empty cycle is compiled into
First of all, please note that writing delays using a busy loop like this is not good since the timing will depend on the details of how your compiler operates. For the AVR platform, use the built-in delay functions provide by avr-libc and GCC, as described in JLH's answer.
Normally, an rcall +0
instruction at the top of a function would be a handy way to double the number of times the function runs. But in this case, we can see that the return addresses are not being returned to, they are in fact being removed from the stack at the end of the function with four pop
instructions.
So at the beginning of the function, the compiler is adding four bytes to the stack and at the end of the function it is removing four bytes from the stack. This is how the compiler allocates storage for your variable, i
. Since i
is a local variable, it generally gets stored on the stack. Compiler optimizations might allow the variable to be stored in registers, but I don't think such optimizations are allowed for volatile
variables. This answers your first and fourth questions.
You marked your variable i
as volatile
, which tells the compiler it cannot make any assumptions about the memory that i
is stored in. Every time your code reads or writes to i
, the compiler must generate a real read or write to the RAM locations that hold i
; it's not allowed to make the optimizations you thought it would make. This answers your second and third questions.
The volatile
keyword is useful for special functions registers on your chip, and it is useful for variables that are shared between the main loop and an interrupt.