Search code examples
carduinoavrarduino-unoatmel

Strange delay behavior in Atmega328P


So, I have implemented a custom delay function using the standard one found at utils/delay.h.

inline void delay_us(uint16_t time) {
    while (time > 0) {
        _delay_us(1);
        time--;
    }
}

It's called inside a loop in the main function:

#define F_CPU 16000000UL

...

int main() {
    pin_mode(P2, OUTPUT);
    while (1) {
        pin_enable(P2);
        delay_us(1);
        pin_disable(P2);
        delay_us(1);
    }
}

Using an oscilloscope, I can tell that the pin stays 1.120us high and 1.120us low, with 1 as parameter. Incrementing the parameter to 6, the the oscilloscope shows me 6.120us. But with 7, it stays 9 us. With 10, about 14 us.

I know the loop comes with an overhead, but why there is no overhead (or why the overhead does not changes) between 1 and 6 us?

OBS: I'm using an Arduino UNO (16 MHz)


Solution

  • For small arguments gcc-avr will unroll the while loop, effectively stringing together multiple 1µs-delays:

    delay_us(5):
        ldi r24,lo8(5)
        mov r25,r24
        1: dec r25
        brne 1b
        mov r25,r24
        1: dec r25
        brne 1b
        mov r25,r24
        1: dec r25
        brne 1b
        mov r25,r24
        1: dec r25
        brne 1b
        1: dec r24
        brne 1b
    

    At some point however, the compiler changes its strategy from space-consuming unrolling to actually branching through the while loop:

    delay_us(6):
        ldi r24,lo8(6)
        ldi r25,hi8(6)
        ldi r19,lo8(5)
    .L2:
        mov r18,r19
        1: dec r18
        brne 1b
        sbiw r24,1
        brne .L2
    

    At that time, the carefully crafted _delay_us() function will be more or less defeated. The branch overhead is significant compared to the 16 clock cycles needed for a single _delay_us(1) and will be paid for every loop iteration.

    The sudden increase in runtime you describe is basically the point at which your compiler stops to unroll the loop.

    Compare this to calling _delay_us(6) directly:

    _delay_us(6):
        ldi r24,lo8(32)
        1: dec r24
        brne 1b
    

    The assembly shown above might differ somewhat from what your compiler is doing since compiler output can vary significantly with version and flags but the listings should be reasonably close. For the examples I assumed gcc-avr 4.6.4 with optimization level -O2. Try it out