I originally had the following C code:
volatile register uint16_t counter asm("r12");
__uint24 getCounter() {
__uint24 res = counter;
res = (res << 8) | TCNT0;
return res;
}
This function runs in some hot places and is inlined, and I'm trying to cram a lot of stuff into an ATtiny13, so it came time to optimize it.
That function compiles to:
getCounter:
movw r24,r12
ldi r26,0
clr r22
mov r23,r24
mov r24,r25
in r25,0x32
or r22,r25
ret
I came up with this assembly:
inline __uint24 getCounter() {
//__uint24 res = counter;
//res = (res << 8) | TCNT0;
uint32_t result;
asm(
"in %A[result],0x32" "\n\t"
"movw %C[result],%[counter]" "\n\t"
"mov %B[result],%C[result]" "\n\t"
"mov %C[result],%D[result]" "\n\t"
: [result] "=r" (result)
: [counter] "r" (counter)
:
);
return (__uint24) result;
}
The reason for uint32_t
is to "allocate" the fourth consecutive register and for the compiler to understand it is clobbered (since I cannot do something like "%D[result]"
in the clobber list)
Is my assembly correct? From my testing it seems like it is.
Is there a way to allow the compiler to optimize getCounter()
better so there's not need for confusing assembly?
Is there a better way to do this in assembly?
EDIT: The whole idea with the movw
is to keep the read atomic, since the counter
variable is incremented inside of an interrupt.
As it seems from my experiments in GodBolt, even with the -O3
flag avr-gcc optimizer is just not sophisticated enough. I doubt there are any other flags that can trick it into optimizing this specific code more (I tried some but none helped).
But there is an alternative way to write the some code using union
and in that case compiler optimizes the assembly better. Thus, no need to resort to inline assembly.
counter
variable is stored in r12
(LSB) and r13
(MSB) registers.TCNT0
is read from I/O space address 0x32 (by in Rd, 0x32
instruction).r22(LSB):r23:r24(MSB)
.r24 <-- r13
r23 <-- r12
r22 <-- TCNT0
Looking into the code, I guess you have some kind of timer interrupt incrementing counter
when the timer reaches some upper threshold. If that's the case, the code has a deeper problem, even in the pure C version. The important part is that the read of both TCNT0
and counter
should be atomic together as single unit! Otherwise, if the interrupt occurs between the movw
and in
instructions, your result will be inaccurate. Example of scenario demonstrating the bug:
counter = 0x0010, TCNT0 = 0xff
MOVW copies 0x0010
Interrupt occurs => handler sets counter = 0x0011 and TCNT0 = 0
IN instruction reads TCNT0 = 0
result = 0x0010_00 (instead of expected 0x0010_ff)
There are two ways for to solve this:
CLI / SEI
around the two reads to get them together without possible interrupt in the middle.TCNT0
twice, before and after reading the counter. If the second read gives smaller result, it means an interrupt in between and we can't trust the values, repeat the whole read.Thus, a correct solution, without the bug might be like this (add inline specification on the function as needed):
__uint24 getCounter() {
union
{
__uint24 result;
struct {
uint8_t lo;
uint16_t hi;
} parts;
} u;
__builtin_avr_cli();
u.parts.hi = counter;
u.parts.lo = TCNT0;
__builtin_avr_sei();
return u.result;
}
Producing:
getCounter:
cli
mov r23,r12
mov r24,r13
in r22,0x32
sei
ret
Godbolt: https://godbolt.org/z/YrWrT8sT4
With the atomicity requirement added, we must use the movw
instruction. Here is a version that minimizes the amount of inline assembly and uses as much C as possible:
__uint24 getCounter() {
union
{
__uint24 result;
struct {
uint8_t lo;
uint16_t hi;
} parts;
} u;
uint16_t tmp;
// Ensure the counter is read atomically with movw instruction
asm("movw %C[tmp],%[counter] \n\t" : [tmp] "=r" (tmp) : [counter] "r" (counter));
u.parts.hi = tmp;
u.parts.lo = TCNT0;
return u.result;
}
Godbolt: https://godbolt.org/z/P9a9K6n76
It looks correct and provides the right results. However, there are two things I can suggest to improve:
mov
instructions, taking 3 cycles to execute. gcc generated similar code because movw
operates only on evenly aligned registers. But you can replace these with just 2 mov
instructions and it will also remove the need for the larger uint32_t
variable.TCNT0
address for better code portability.So here is a slightly modified version of your code:
inline __uint24 getCounter() {
__uint24 result;
asm(
"in %A[result], %[tcnt0]" "\n\t"
"mov %B[result], %A[counter]" "\n\t"
"mov %C[result], %B[counter]" "\n\t"
: [result] "=r" (result)
: [counter] "r" (counter)
, [tcnt0] "I" (_SFR_IO_ADDR(TCNT0))
);
return result;
}
However, note a downside of this solution – we loose atomicity on reading the counter. If an interrupt occurs between the two mov
instructions and counter
is modified inside the interrupt, we might get correct results. But if counter
is never modified by interrupts, I would prefer using the two separate mov
instructions for performance benefits.
Godbolt: https://godbolt.org/z/h3nT4or97
(I removed inline
keywords to show the generated assembly)