Performance comparison of atomic operations on different sizes

How do the performance of atomic operations operating on the natural word size of the processor (4-byte or 8-byte) compare to that of operating on other sizes (2-byte or 1-byte)?

If I need to maintain a boolean atomic variable, I'm trying to figure out the what the best practice is: use 1-byte to optimize for space, or 4/8-byte to (potentially) optimize for performance.

Solution

http://agner.org/optimize/ for lots of details.

On x86, an array of 1-byte data should be good. It can be loaded with movzx (zero-extend) just as fast as with a plain mov.

x86 has bit ops to support atomic bitfields, if you want to pack your data by another factor of 8. I'm not sure how well compilers will do at making efficient code for that case, though. Even a write-only operation requires a slow atomic RMW cycle for the byte holding the bit you want to write. (On x86, it would a lock OR instruction, which is a full memory barrier. It's 8 uops on Intel Haswell, vs. 1 for a byte store. A factor of 19 in throughput.) This is probably still worth it if it means the difference between lots of cache misses and few cache misses, esp. if most of the access is read-only. (Reading a bit is fast, exactly the same as the non-atomic case.)

2-byte (16bit) operations are potentially slow on x86, esp. on Intel CPUs. Intel instruction decoders slow down a lot when they have to decode an instruction with a 16bit immediate operand. This is the dreaded LCP stall from the operand-size prefix. (8b ops have a whole different opcode, and 32 vs. 64bit is selected by the REX prefix, which doesn't slow down the decoders). So 16b is the odd-one-out, and you should be careful using it. Prefer to load 16b memory into 32b variables to avoid partial-register penalties and 16bit immediates when working with a temporary. (AMD CPUs aren't quite as efficient at handling movzx loads (takes an ALU unit and extra 1 cycle latency), but the savings in memory are still almost always worth it (for cache reasons)).

32b is the "optimal" size to use for local scratch variables. No prefix is needed to select that size (increasing code density), and there won't be partial-register stalls or extra uops when using the full register again after using the low 8b. I believe this is the purpose of the int_fast32_t type, but on x86 Linux that type is unfortunately 64bit.