Search code examples
cachingcpuintelcpu-architecturecpu-cache

What does a cache line in a CPU consist of besides the usual tags, data, and dirty+valid bits?


I've been doing some research on caching recently. I'm curious what makes up a cache line?

In CPUs, this often has an L1 data cache, L1 instruction cache, L2 data cache, and Last level cache.

enter image description here

In the L1 cache, each cache line has data bits, and the data is often 64 bytes. There is a tag field, which is used for comparison when looking up the cache, and a dirty bit, which is used to determine whether the data in the cache line has been modified.

enter image description here

In the case of multiple cores, a MESI cache coherence protocol needs to be maintained.

Assuming that there are n cores, each cache line of LLC needs n bits to record which cores the cache line is in.

These are all learned in textbooks. But I'm curious, does the cache line only contain these bits?

Are there other bitfields in the cache line?


Solution

  • There's normally some bits for pseudo-LRU to help make a better choice of which line to evict when necessary. Full LRU would be more expensive but not much better, so is usually not done, especially for caches that are 8-way associative or more. (See Why Bit-PLRU is different from LRU, and especially Andreas Abel's answer on What cache invalidation algorithms are used in actual CPU caches? with some practical info on Intel CPUs.)

    Intel since Ivy Bridge has used an adaptive replacement policy in their L3 cache, but that's still just some bits per cache line that are updated according to some rules.


    Caches are normally protected with ECC, although rumour has it that Intel only uses parity in their L1d caches to keep overhead down while allowing efficient unaligned and narrow stores without an RMW cycle to commit to cache. (Can modern x86 hardware not store a single byte to memory?)

    Outer caches can treat the whole 64-byte line as a single ECC granule to reduce overhead, since there isn't a need for writing part of a line. L1i is read-only and can just re-fetch from elsewhere if parity or ECC finds a problem. When L1d caches do use ECC, 4-byte or 8-byte granules are typical, with SECDED (single-error correct / double-error detect) costing 7 bits per 32-bit granule of data, 22% overhead. vs. 50% for 8-bit granules (4 ECC bits). 64-bit granules further reduce overhead.

    Having caches use ECC costs overhead, but a lot of cells holding the only copy of some data, running at minimal voltages; what could go wrong?


    Atomic RMW operations like lock cmpxchg [rdi], eax will take "cache lock" on the line involved, unless the operation is split across cache lines in which case it needs to globally lock memory. A "cache lock" means that line won't reply to MESI requests to share it, not until another uop "unlocks" that line, so it stays exclusively owned for the duration of the atomic RMW. From the PoV of any other core, the operation on that line was atomic. (And it's also a full memory barrier, so nothing at all tricky can ever be observed, unlike on weakly-ordered ISAs.)
    Tracking whether a line is locked or not might involve an extra bit of metadata. Or if only one line can be locked at once, maybe just a single "register" to track which line (if any) is currently locked. That probably makes more sense than a bit in every line of L1d, so nevermind this!

    Speaking of MESI, a line needs to track its MESIF or MOESI state, not just dirty/valid, unless this is a single-processor system so the cache doesn't need to maintain coherency with any others. For classic MESI, 4 states including Invalid, your 2 bits for Valid + Dirty bits are already sufficient. But Intel MESIF / AMD MOESI introduce an extra state which might take an extra bit. (Which cache-coherence-protocol does Intel and AMD use? suggests that tracking the "forwarded" state might not actually take an extra bit, though, at least not in L1d / L2. See also What cache coherence solution do modern x86 CPUs use?).

    Your book is also showing use of extra L3 metadata to track which core might have a copy of the line. Only one can ever have a modified copy, if cache-to-cache transfers must go through or at least update L3 so that stays in sync. For sending out invalidates, yes it can be helpful to filter by which cores could possibly have a copy of the line, though, instead of broadcasting to all cores.

    Snoop filters can be built separately from L3 tags, so you're not limited by L3 associativity in terms of which sets of lines the L2 / L1d caches on each core can be caching. (Skylake-Xeon and later do this, with 1MiB L2 caches and total L3 size of only about 1.3M per core, but no longer inclusive like Intel had been doing since Nehalem, first-gen i7. Even current-generation "client" CPUs from Intel, non-Xeon, as far as I know still use inclusive L3 a ring bus, not the mesh interconnect in Xeon Scalable. See the cache-coherence link above.)


    Nothing else comes to mind, but I wouldn't be surprised if there's something I'm not thinking of.

    I think PCID (process-context ID) stuff is just for TLB entries, since caches are essentially physically addressed. (VIPT is just a speed boost since Intel makes their L1 caches associative enough that both synonyms and homonyms are impossible without the OS needing to do page colouring.)

    In Pentium 4 CPUs, when hyperthreading was new, there was a mode where the separate hyperthreads didn't share the same lines in L1d cache, so they were tagged with a core ID #. (One bit). That was basically a fallback in case a design bug turned up with how two cores shared the same line, to be enabled via microcode update, but I think current CPUs don't have that. See "shared mode" in What will be used for data exchange between threads are executing on one Core with HT? - current CPUs only support "shared mode", not the slow mode where they can't both access a hot cache line for the same address.


    Optional extras

    On Intel Haswell/Skylake, there might be some extra bits to track TSX transaction status (2 bits: read-set and write-set) in L1d, or maybe that would be a separate structure that can be scanned easily. The new data during a transaction has to go somewhere, and it turns out Intel picked L1d and L2. (https://www.realworldtech.com/haswell-tm-alt/ discusses two alternatives before it was known that cache was the mechanism, not the memory-order-buffer. https://www.realworldtech.com/haswell-tm/3/ has some mention of tag bits being involved). I guess written cache lines might have to be written back (cleaned) at least to L2 before the transaction starts, so on abort the write-set lines can just be invalidated, but I didn't re-check those articles.

    L1i cache might mark instruction boundaries - some CPUs (especially AMD) did this, especially before introducing a uop cache. Most x86 machine code doesn't have overlapping instructions, like a jump backwards into the middle of an instruction that previously executed. So instead of pre-decode redoing this work on every fetch, keep it in L1i.

    Some AMD CPUs do way-prediction with micro-tags, extra bits associated with each cache line. This saves power in the common case.