cpu intel cpu-architecture cpu-cache gem5

Is the L1-Dcache the ultimate data cache and is DSB also a cache that can be simulated by gem5?

I wonder if the L1-Dcache is the ultimate cache that data comes from. Because I know for i-cache, there is a DSB which is even closer to CPU which could be seen as L0-icache.
Also, I am interested in what hardware changes could influence DSB's performance? I mean for cache, there are things such as cache size, Cache Associativity. But is DSB also just a cache that can be influenced by those factors?
If yes, can I simulate the results using gem5. I know with gem5, I can configure the L1 instruction cache and observe L1 instruction cache performance. How could same things be done for DSB on gem?

Solution

I wonder if the L1-Dcache is the ultimate cache that data comes from

Yes, or the store buffer. Globally Invisible load instructions explains how partial store-forwarding can let a core load a dword value that was never globally visible, so no other core could have loaded.

The DSB (uop cache) is a cache, but it doesn't cache machine code. It caches the result of decoding x86 machine code into uops.

It has various limitations like not using more than 3 "lines" for uops from the same 32-byte block of x86 machine code, so modeling is it not as simple as just size / assocativity. e.g. each way (aka line) can hold up to 6 uops, but ends with an unconditional (or predicted-taken) branch uop. And all the uops from a multi-uop instruction have to go in the same line.

The number of fused-domain uops from each x86 instruction depend on exactly what instruction it is; see https://uops.info/, but note that un-lamination will mean some instructions take more uops in the issue/rename stage and ROB than they do decoders and uop-cache. (Micro fusion and addressing modes)

Agner Fog's microarch guide has some detailed testing results (https://agner.org/optimize/), and see also https://www.realworldtech.com/sandy-bridge/4/

The basic parameters of Intel's uop cache are, as described in the Sandybridge section Agner's microarch guide:

The µop cache is organized as 32 sets x 8 ways x 6 µops, totaling a maximum capacity of 1536 µops. It can allocate a maximum of 3 lines of 6 µops each for each aligned and contiguous 32-bytes block of code.

AFAIK, this geometry has remained unchanged from SnB through Skylake and Ice Lake.

The L1i cache is inclusive of the uop cache. The uop cache is virtually-addressed, so TLB lookups aren't needed. But it has to be evicted on TLB invalidation as well, I guess. (That's not a huge problem because the legacy decoders are quite good; Sandybridge-family avoided problems of P4's slow decoding, and trying to use its trace cache instead of a normal L1i.)

Note that AMD's Zen microarchitecture family also uses a uop cache. They don't call it a DSB, and it presumably has some differences from Intel's.

Also, I am interested in what hardware changes could influence DSB's performance?

Skylake increased the bandwidth of uop-cache -> IDQ from 4 to 6 uops per cycle. So even in high-throughput code, the uop-cache can "catch up" after bubbles partially drain the IDQ.

It can still only read 1 uop cache line per cycle, though, so for example on a Skylake where microcode updates disabled the loop buffer (LSD), a tiny loop that would normally run at 1 cycle per iteration can slow down to 2 cycles if the loop is split across a 32-byte boundary, because that means its uops will be in 2 separate uop-cache lines. (Like 1 or 2 from each line.)

But Haswell can sustain 4 uops per clock from the uop cache under ideal conditions, even with instructions that fully pack uop cache lines with 6 uops per line. So there's apparently some buffering between uop cache-line fetch and adding to the IDQ, otherwise it would be a 4 : 2 pattern if all the uops added to the IDQ had to come from the same line.