Should volatile still be used for sharing data with ISRs in modern C++?

I've seen some flavors of these question around and I've seen mixed answers, still unsure whether they are up-to-date and fully apply to my use case, so I'll ask here. Do let me know if it's a duplicate!

Given that I'm developing for STM32 microcontrollers (bare-metal) using C++17 and the gcc-arm-none-eabi-9 toolchain:

Do I still need to use volatile for sharing data between an ISR and main()?

volatile std::int32_t flag = 0;

extern "C" void ISR()
{
    flag = 1;
}

int main()
{
    while (!flag) { ... }
}

It's clear to me that I should always use volatile for accessing memory-mapped HW registers.

However for the ISR use case I don't know if it can be considered a case of "multithreading" or not. In that case, people recommend using C++11's new threading features (e.g. std::atomic). I'm aware of the difference between volatile (don't optimize) and atomic (safe access), so the answers suggesting std::atomic confuse me here.

For the case of "real" multithreading on x86 systems I haven't seen the need to use volatile.

In other words: can the compiler know that flag can change inside ISR? If not, how can it know it in regular multithreaded applications?

Thanks!

Solution

I think that in this case both volatile and atomic will most likely work in practice on the 32 bit ARM. At least in an older version of STM32 tools I saw that in fact the C atomics were implemented using volatile for small types.

Volatile will work because the compiler may not optimize away any access to the variable that appears in the code.

However, the generated code must differ for types that cannot be loaded in a single instruction. If you use a volatile int64_t, the compiler will happily load it in two separate instructions. If the ISR runs between loading the two halves of the variable, you will load half the old value and half the new value.

Unfortunately using atomic<int64_t> may also fail with interrupt service routines if the implementation is not lock free. For Cortex-M, 64-bit accesses are not necessarily lockfree, so atomic should not be relied on without checking the implementation. Depending on the implementation, the system might deadlock if the locking mechanism is not reentrant and the interrupt happens while the lock is held. Since C++17, this can be queried by checking atomic<T>::is_always_lock_free. A specific answer for a specific atomic variable (this may depend on alignment) may be obtained by checking flagA.is_lock_free() since C++11.

So longer data must be protected by a separate mechanism (for example by turning off interrupts around the access and making the variable atomic or volatile.

So the correct way is to use std::atomic, as long as the access is lock free. If you are concerned about performance, it may pay off to select the appropriate memory order and stick to values that can be loaded in a single instruction.

Not using either would be wrong, the compiler will check the flag only once.

These functions all wait for a flag, but they get translated differently:

#include <atomic>
#include <cstdint>

using FlagT = std::int32_t;

volatile FlagT flag = 0;
void waitV()
{
    while (!flag) {}
}

std::atomic<FlagT> flagA;
void waitA()
{
    while(!flagA) {}    
}

void waitRelaxed()
{
    while(!flagA.load(std::memory_order_relaxed)) {}    
}

FlagT wrongFlag;
void waitWrong()
{
    while(!wrongFlag) {}
}

Using volatile you get a loop that reexamines the flag as you wanted:

waitV():
        ldr     r2, .L5
.L2:
        ldr     r3, [r2]
        cmp     r3, #0
        beq     .L2
        bx      lr
.L5:
        .word   .LANCHOR0

Atomic with the default sequentially consistent access produces synchronized access:

waitA():
        push    {r4, lr}
.L8:
        bl      __sync_synchronize
        ldr     r3, .L11
        ldr     r4, [r3, #4]
        bl      __sync_synchronize
        cmp     r4, #0
        beq     .L8
        pop     {r4}
        pop     {r0}
        bx      r0
.L11:
        .word   .LANCHOR0

If you do not care about the memory order you get a working loop just as with volatile:

waitRelaxed():
        ldr     r2, .L17
.L14:
        ldr     r3, [r2, #4]
        cmp     r3, #0
        beq     .L14
        bx      lr
.L17:
        .word   .LANCHOR0

Using neither volatile nor atomic will bite you with optimization enabled, as the flag is only checked once:

waitWrong():
        ldr     r3, .L24
        ldr     r3, [r3, #8]
        cmp     r3, #0
        bne     .L23
.L22:                        // infinite loop!
        b       .L22
.L23:
        bx      lr
.L24:
        .word   .LANCHOR0
flag:
flagA:
wrongFlag: