Disabled DCache will prevent atomic_flag from being set

We are using a zynq-7000 based CPU, so an cortex-a9 and we encountered the following issue while using atomic_flags which are inside an library we are using (open-amp).

We are using the second CPU on the SoC to execute bare-metal code.
When disabling the dcache, atomic ints are no longer able to be set, here's a simple code which triggers the issue for us:

#define XREG_CONTROL_DCACHE_BIT (0X00000001U<<2U)
#define XREG_CP15_SYS_CONTROL   "p15, 0, %0,  c1,  c0, 0"
#define mfcp(rn)    ({uint32_t rval = 0U; \
             __asm__ __volatile__(\
               "mrc " rn "\n"\
               : "=r" (rval)\
             );\
             rval;\
             })
#define mtcp(rn, v) __asm__ __volatile__(\
             "mcr " rn "\n"\
             : : "r" (v)\
            );

static void DCacheDisable(void) {
    uint32_t CtrlReg;
    /* clean and invalidate the Data cache */
    CtrlReg = mfcp(XREG_CP15_SYS_CONTROL);

    CtrlReg &= ~(XREG_CONTROL_DCACHE_BIT);
    /* disable the Data cache */
    mtcp(XREG_CP15_SYS_CONTROL, CtrlReg);
}

int main(void) {
    DCacheDisable();

    atomic_int flag = 0;
    printf("Before\n");
    atomic_flag_test_and_set(&flag);
    printf("After\n");
}

The CPU executes the following loop for atomic_flag_test_and_set:

dmb     ish
ldrexb  r1, [r3] ; bne jumps here
strexb  r0, r2, [r3]
cmp     r0, #0
bne     -20     ; addr=0x1f011614: main + 0x00000060
dmb     ish

but the register r0 always stays 1. When omitting the function call to DCacheDisable, the code works flawlessly.

I really can't find any any information about disabled dcache and atomic flags.

Does anybody has a clue?

Toolchain: We are using vitis 2022.2 which comes with a arm-xilinx-eabi-gcc.exe (GCC) 11.2.0. Compiler options are -O2 -std=c11 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=hard

Solution

This is common on ARM platforms that support a cache. The cache line is being used as a temporary store for the exclusive lock. The term in ARM is exclusive reserve granule or the size of locked memory. On systems with a cache, you will find the granule is a cache line size.

So internally, the ldrex and strex are implemented as part of the cache resolution policy. You can compare it to cortex-m systems, where the entire memory space is a reserve granule.

The ldrex/strex pair are useless for synchronizing with external devices that are not part of an AXI structure. If you want to disable cache to work with an FPGA interface, I don't believe this can work. You would need to implement the cache protocol in the FPGA.

For Cortex-M systems, there is no cache structure and custom logic implements a 'global monitor'.

The cache mechanism actually seems useful as the cache line could be used as a transactional memory. Ie, either the whole line commits on not. It seems possible to create lock-free algorithm for structures with multiple pointers. The node do not lock an entire list but only an entry. However, I haven't seen it used like this ever. Mainly I think because the ARM documentation recommends not to do this (do not rely on the ERG size).