Moving data into __uint24 with assembly

I originally had the following C code:

volatile register uint16_t counter asm("r12");

__uint24 getCounter() {
  __uint24 res = counter;
  res = (res << 8) | TCNT0;
  return res;
}

This function runs in some hot places and is inlined, and I'm trying to cram a lot of stuff into an ATtiny13, so it came time to optimize it.

That function compiles to:

getCounter:
        movw r24,r12
        ldi r26,0
        clr r22
        mov r23,r24
        mov r24,r25
        in r25,0x32
        or r22,r25
        ret

I came up with this assembly:

inline __uint24 getCounter() {
  //__uint24 res = counter;
  //res = (res << 8) | TCNT0;
  
  uint32_t result;
  asm(
    "in %A[result],0x32" "\n\t"
    "movw %C[result],%[counter]" "\n\t"
    "mov %B[result],%C[result]" "\n\t"
    "mov %C[result],%D[result]" "\n\t"
    : [result] "=r" (result)
    : [counter] "r" (counter)
    :
  );
  return (__uint24) result;
}

The reason for uint32_t is to "allocate" the fourth consecutive register and for the compiler to understand it is clobbered (since I cannot do something like "%D[result]" in the clobber list)

Is my assembly correct? From my testing it seems like it is. Is there a way to allow the compiler to optimize getCounter() better so there's not need for confusing assembly? Is there a better way to do this in assembly?

EDIT: The whole idea with the movw is to keep the read atomic, since the counter variable is incremented inside of an interrupt.

Solution

As it seems from my experiments in GodBolt, even with the -O3 flag avr-gcc optimizer is just not sophisticated enough. I doubt there are any other flags that can trick it into optimizing this specific code more (I tried some but none helped).

But there is an alternative way to write the some code using union and in that case compiler optimizes the assembly better. Thus, no need to resort to inline assembly.

Original code analysis

The counter variable is stored in r12 (LSB) and r13 (MSB) registers.
TCNT0 is read from I/O space address 0x32 (by in Rd, 0x32 instruction).
According to the avr-gcc ABI, the 24-bit value is returned in r22(LSB):r23:r24(MSB).
So to summarize, we want the following transfer to occur:
```
r24 <-- r13
r23 <-- r12
r22 <-- TCNT0   
```

Even newer solution (no inline assembly!)

Looking into the code, I guess you have some kind of timer interrupt incrementing counter when the timer reaches some upper threshold. If that's the case, the code has a deeper problem, even in the pure C version. The important part is that the read of both TCNT0 and counter should be atomic together as single unit! Otherwise, if the interrupt occurs between the movw and in instructions, your result will be inaccurate. Example of scenario demonstrating the bug:

counter = 0x0010, TCNT0 = 0xff
MOVW copies 0x0010
Interrupt occurs => handler sets counter = 0x0011 and TCNT0 = 0 
IN instruction reads TCNT0 = 0
result = 0x0010_00 (instead of expected 0x0010_ff)

There are two ways for to solve this:

Wrap CLI / SEI around the two reads to get them together without possible interrupt in the middle.
Read TCNT0 twice, before and after reading the counter. If the second read gives smaller result, it means an interrupt in between and we can't trust the values, repeat the whole read.

Thus, a correct solution, without the bug might be like this (add inline specification on the function as needed):

__uint24 getCounter() {
  union
  {
    __uint24 result;

    struct {
      uint8_t lo;
      uint16_t hi;
    } parts;
  } u;

  __builtin_avr_cli();
  u.parts.hi = counter;
  u.parts.lo = TCNT0;
  __builtin_avr_sei();

  return u.result;
}

Producing:

getCounter:
        cli
        mov r23,r12
        mov r24,r13
        in r22,0x32
        sei
        ret

Godbolt: https://godbolt.org/z/YrWrT8sT4

Newer solution (less assembly, partial atomicity)

With the atomicity requirement added, we must use the movw instruction. Here is a version that minimizes the amount of inline assembly and uses as much C as possible:

__uint24 getCounter() {
  union
  {
    __uint24 result;

    struct {
      uint8_t lo;
      uint16_t hi;
    } parts;
  } u;

  uint16_t tmp;

  // Ensure the counter is read atomically with movw instruction
  asm("movw %C[tmp],%[counter]  \n\t" : [tmp] "=r" (tmp) : [counter] "r" (counter));

  u.parts.hi = tmp;
  u.parts.lo = TCNT0;

  return u.result;
}

Godbolt: https://godbolt.org/z/P9a9K6n76

Old solution (without atomicity)

Question author's assembly analysis

It looks correct and provides the right results. However, there are two things I can suggest to improve:

It has 3 mov instructions, taking 3 cycles to execute. gcc generated similar code because movw operates only on evenly aligned registers. But you can replace these with just 2 mov instructions and it will also remove the need for the larger uint32_t variable.
I would avoid hardcoding TCNT0 address for better code portability.

Suggested assembly

So here is a slightly modified version of your code:

inline __uint24 getCounter() {
  __uint24 result;
  asm(
    "in   %A[result], %[tcnt0]"    "\n\t"
    "mov  %B[result], %A[counter]" "\n\t"
    "mov  %C[result], %B[counter]" "\n\t"
    : [result] "=r" (result)
    : [counter] "r" (counter)
    , [tcnt0] "I" (_SFR_IO_ADDR(TCNT0))
  );
  return result;
}

However, note a downside of this solution – we loose atomicity on reading the counter. If an interrupt occurs between the two mov instructions and counter is modified inside the interrupt, we might get correct results. But if counter is never modified by interrupts, I would prefer using the two separate mov instructions for performance benefits.

Godbolt: https://godbolt.org/z/h3nT4or97 (I removed inline keywords to show the generated assembly)