How expensive are memory barriers on ARM64? -- The cost of a DMB instruction

ARM documentation says that memory barrier instructions are "very expensive", but give no real indication as to how expensive. I'm referring specifically to the ARM DMB instructions.

How expensive are they? Hundreds of CPU cycles, or thousands of CPU cycles?

Horrors that I can imagine:

That a read memory barrier completely invalidates L1 cache on the current core.
That a write barrier has to wait for all dirty L1 cache lines to either write back to L2 (single processor/multi-core), or main memory (multiprocessor).

Could that not possibly be thousands of CPU cycles? e.g. the time to write 8,192 dirty L1 cache lines to L2 cache at ~7 CPU cycles each.

For context, I'm concerned about interactions between convolution FFTs running on a background thread (which definitely dirty the entire L1 cache), and smaller locking-free queues sending data packets to and from a Realtime audio thread. I'm wondering whether it's worth using atomic instructions to read and write data in the queue, instead of using DMB memory barriers.

I wondering whether the following is taking place: a large FFTs runs on a background thread, completely dirtying L1 cache; the audio thread wakes, issues a DMB write-barrier instruction, and now has to wait until the entirely dirty L1 cache writes back to L2.

The processor in question is a Raspberry Pi 4, ARM 8.1a. ARM 8.1a has atomic instructions for "improving the performance of multithreaded applications", which provide a path for avoiding DMB instructions altogether on the audio thread.

Preemptively: yes, I'm using C++ 20 operations to generate the read and write barriers. Have not decided whether I'm using mutexes or not, but the same question applies, since mutexes have memory barriers too. I've checked. GCC generates the correct 8.1a instructions for atomic operations, and compiles to DMB instructions for memory barriers.

Solution

FYI Raspberry Pi does not contain an ARMv8.1 processor. It's a Cortex-A72, it's very much so in the ARMv8.0 family and therefore doesn't have those atomic instructions. C++ atomics should be implemented here with Exclusive operations although this is going to be dependent on your C++ library.

To answer the general question of how expensive a DMB is, it depends on the load of the system. The actual instruction doesn't take any realistic time to execute (a low single digit number of cycles here or there) but the barrier will prevent loads and/or stores after it, in program order, from being sent to the system. Both cache maintenance and natural cache activity are considered stores. Cache coherency requests from or to the current core can also prevent the barrier from finishing. On the Raspberry Pi, that's a quad-core cluster of Cortex-A72 sharing a single AXI bus to the memory system, so the activity of the other 3 cores is going to play a part in how long that barrier makes the current processor wait.

It's rather hard to predict in that sense as you need a holistic view of what is happening in a system, but very easy to reason about and benchmark. If you do have a barrier that is particularly egregious it might show up with a percentage next to it in a 'perf' report on Linux - this is merely an indication that when perf sampled the state of the application that the program counter was lingering on that instruction than it is the performance of the barrier.