performance x86 arm cpu-architecture cpu-cache

Are there any modern CPUs where a cached byte store is actually slower than a word store?

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register.

But I've never seen any examples. No x86 CPUs are like this, and I think all high-performance CPUs can directly modify any byte in a cache-line, too. Are some microcontrollers or low-end CPUs different, if they have cache at all?

(I'm not counting word-addressable machines, or Alpha which is byte addressable but lacks byte load/store instructions. I'm talking about the narrowest store instruction the ISA natively supports.)

In my research while answering Can modern x86 hardware not store a single byte to memory?, I found that the reasons Alpha AXP omitted byte stores presumed they'd be implemented as true byte stores into cache, not an RMW update of the containing word. (So it would have made ECC protection for L1d cache more expensive, because it would need byte granularity instead of 32-bit).

I'm assuming that word-RMW during commit to L1d cache wasn't considered as an implementation option for other more-recent ISAs that do implement byte stores.

All modern architectures (other than early Alpha) can do true byte loads/stores to uncacheable MMIO regions (not RMW cycles), which is necessary for writing device drivers for devices that have adjacent byte I/O registers. (e.g. with external enable/disable signals to specify which parts of a wider bus hold the real data, like the 2-bit TSIZ (transfer size) on this ColdFire CPU/microcontroller, or like PCI / PCIe single byte transfers, or like DDR SDRAM control signals that mask selected bytes.)

Maybe doing an RMW cycle in cache for byte stores would be something to consider for a microcontroller design, even though it's not for a high-end superscalar pipelined design aimed at SMP servers / workstations like Alpha?

I think this claim might come from word-addressable machines. Or from unaligned 32-bit stores requiring multiple accesses on many CPUs, and people incorrectly generalizing from that to byte stores.

Just to be clear, I expect that a byte store loop to the same address would run at the same cycles per iterations as a word store loop. So for filling an array, 32-bit stores can go up to 4x faster than 8-bit stores. (Maybe less if 32-bit stores saturate memory bandwidth but 8-bit stores don't.) But unless byte stores have an extra penalty, you won't get more than a 4x speed difference. (Or whatever the word width is).

And I'm talking about asm. A good compiler will auto-vectorize a byte or int store loop in C and use wider stores or whatever is optimal on the target ISA, if they're contiguous.

(And store coalescing in the store buffer could also result in wider commits to L1d cache for contiguous byte-store instructions, so that's another thing to watch out for when microbenchmarking)

; x86-64 NASM syntax
mov   rdi, rsp
; RDI holds at a 32-bit aligned address
mov   ecx, 1000000000
.loop:                      ; do {
    mov   byte [rdi], al
    mov   byte [rdi+2], dl     ; store two bytes in the same dword
      ; no pointer increment, this is the same 32-bit dword every time
    dec   ecx
    jnz   .loop             ; }while(--ecx != 0}


    mov   eax,60
    xor   edi,edi
    syscall         ; x86-64 Linux sys_exit(0)

Or a loop over an 8kiB array like this, storing 1 byte or 1 word out of every 8 bytes (for a C implementation with sizeof(unsigned int)=4 and CHAR_BIT=8 for the 8kiB, but should compile to comparable functions on any C implementation, with only a minor bias if sizeof(unsigned int) isn't a power of 2). ASM on Godbolt for a few different ISAs, with either no unrolling, or the same amount of unrolling for both versions.

// volatile defeats auto-vectorization
void byte_stores(volatile unsigned char *arr) {
    for (int outer=0 ; outer<1000 ; outer++)
        for (int i=0 ; i< 1024 ; i++)      // loop over 4k * 2*sizeof(int) chars
            arr[i*2*sizeof(unsigned) + 1] = 123;    // touch one byte of every 2 words
}

// volatile to defeat auto-vectorization: x86 could use AVX2 vpmaskmovd
void word_stores(volatile unsigned int *arr) {
    for (int outer=0 ; outer<1000 ; outer++)
        for (int i=0 ; i<(1024 / sizeof(unsigned)) ; i++)  // same number of chars
            arr[i*2 + 0] = 123;       // touch every other int
}

Adjusting sizes as necessary, I'd be really curious if anyone can point to a system where word_store() is faster than byte_store(). (If actually benchmarking, beware of warm-up effects like dynamic clock speed, and the first pass triggering TLB misses and cache misses.)

Or if actual C compilers for ancient platforms don't exist or generate sub-optimal code that doesn't bottleneck on store throughput, then any hand-crafted asm that would show an effect.

Any other way of demonstrating a slowdown for byte stores is fine, I don't insist on strided loops over arrays or spamming writes within one word.

I'd also be fine with detailed documentation about CPU internals, or CPU cycle timing numbers for different instructions. I'm leery of optimization advice or guides that could be based on this claim without having tested, though.

Any still-relevant CPU or microcontroller where cached byte stores have an extra penalty?
Any still-relevant CPU or microcontroller where un-cacheable byte stores have an extra penalty?
Any not-still-relevant historical CPUs (with or without write-back or write-through caches) where either of the above are true? What's the most recent example?

e.g. is this the case on an ARM Cortex-A?? or Cortex-M? Any older ARM microarchitecture? Any MIPS microcontroller or early MIPS server/workstation CPU? Anything other random RISC like PA-RISC, or CISC like VAX or 486? (CDC6600 was word-addressable.)

Or construct a test-case involving loads as well as stores, e.g. showing word-RMW from byte stores competing with load throughput.

(I'm not interested in showing that store-forwarding from byte stores to word loads is slower than word->word, because it's normal that SF only works efficiently when when a load is fully contained in the most recent store to touch any of the relevant bytes. But something that showed byte->byte forwarding being less efficient than word->word SF would be interesting, maybe with bytes that don't start at a word boundary.)

(I didn't mention byte loads because that's generally easy: access a full word from cache or RAM and then extract the byte you want. That implementation detail is indistinguishable other than for MMIO, where CPUs definitely don't read the containing word.)

On a load/store architecture like MIPS, working with byte data just means you use lb or lbu to load and zero or sign-extend it, then store it back with sb. (If you need truncation to 8 bits between steps in registers, then you might need an extra instruction, so local vars should usually be register sized. Unless you want the compiler to auto-vectorize with SIMD with 8-bit elements, then often uint8_t locals are good...) But anyway, if you do it right and your compiler is good, it shouldn't cost any extra instructions to have byte arrays.

I notice that gcc has sizeof(uint_fast8_t) == 1 on ARM, AArch64, x86, and MIPS. But IDK how much stock we can put in that. The x86-64 System V ABI defines uint_fast32_t as a 64-bit type on x86-64. If they're going to do that (instead of 32-bit which is x86-64's default operand-size), uint_fast8_t should also be a 64-bit type. Maybe to avoid zero-extension when used as an array index? If it was passed as a function arg in a register, since it could be zero extended for free if you had to load it from memory anyway.

Solution

My guess was wrong. Modern x86 microarchitectures really are different in this way from some (most?) other ISAs.

There can be a penalty for cached narrow stores even on high-performance non-x86 CPUs. The reduction in cache footprint can still make int8_t arrays worth using, though. (And on some ISAs like MIPS, not needing to scale an index for an addressing mode helps).

Merging / coalescing in the store buffer between byte stores instructions to the same word before actual commit to L1d can also reduce or remove the penalty. (x86 sometimes can't do as much of this because its strong memory model requires all stores to commit in program order.)

ARM's documentation for Cortex-A15 MPCore (from ~2012) says it uses 32-bit ECC granularity in L1d, and does in fact do a word-RMW for narrow stores to update the data.

The L1 data cache supports optional single bit correct and double bit detect error correction logic in both the tag and data arrays. The ECC granularity for the tag array is the tag for a single cache line and the ECC granularity for the data array is a 32-bit word.

Because of the ECC granularity in the data array, a write to the array cannot update a portion of a 4-byte aligned memory location because there is not enough information to calculate the new ECC value. This is the case for any store instruction that does not write one or more aligned 4-byte regions of memory. In this case, the L1 data memory system reads the existing data in the cache, merges in the modified bytes, and calculates the ECC from the merged value. The L1 memory system attempts to merge multiple stores together to meet the aligned 4-byte ECC granularity and to avoid the read-modify-write requirement.

(When they say "the L1 memory system", I think they mean the store buffer, if you have contiguous byte stores that haven't yet committed to L1d.)

Note that the RMW is atomic, and only involves the exclusively-owned cache line being modified. This is an implementation detail that doesn't affect the memory model. So my conclusion on Can modern x86 hardware not store a single byte to memory? is still (probably) correct that x86 can, and so can every other ISA that provides byte store instructions.

Cortex-A15 MPCore is a 3-way out-of-order execution CPU, so it's not a minimal power / simple ARM design, yet they chose to spend transistors on OoO exec but not efficient byte stores.

Presumably without the need to support efficient unaligned stores (which x86 software is more likely to assume / take advantage of), having slower byte stores was deemed worth it for the higher reliability of ECC for L1d without excessive overhead.

Cortex-A15 is probably not the only, and not the most recent, ARM core to work this way.

Other examples (found by @HadiBrais in comments):

Alpha 21264 (see Table 8-1 of Chapter 8 of this doc) has 8-byte ECC granularity for its L1d cache. Narrower stores (including 32-bit) result in a RMW when they commit to L1d, if they aren't merged in the store buffer first. The doc explains full details of what L1d can do per clock. And specifically documents that the store buffer does coalesce stores.
PowerPC RS64-II and RS64-III (see the section on errors in this doc). According to this abstract, L1 of the RS/6000 processor has 7 bits of ECC for each 32-bits of data.

Alpha was aggressively 64-bit from the ground up, so 8-byte granularity makes some sense, especially if the RMW cost can mostly be hidden / absorbed by the store buffer. (e.g. maybe the normal bottlenecks were elsewhere for most code on that CPU; its multi-ported cache could normally handle 2 operations per clock.)

POWER / PowerPC64 grew out of 32-bit PowerPC and probably cares about running 32-bit code with 32-bit integers and pointers. (So more likely to do non-contiguous 32-bit stores to data structures that couldn't be coalesced.) So 32-bit ECC granularity makes a lot of sense there.