Search code examples
cachingx86cpu-architecturecpu-cachemesi

what's L3$ role part in MESI protocal


I like to know more details of MESI in intel broadwell .

Suppose A cpu socket has 6 cores core 0 to core 5 , each of them has their own L1$ and L2$ and share L3$ , there are a var X in shared memory , x located in cache line called XCacheL , the following is the detail for my question:

T1 : Core 0 and core 4 and core 5 has x = 100 and XCacheL is Shared status since 3 cores has the copy of XCacheL .

T2 : Core 0 require to modify x , so core 0 broadcast invalidate signal and core 4 and core 5 receive the signal ,invalidate their copy of XCacheL , Core 0 modify x to 200 and XCacheL status now is Modified .

T3: core 4 require to read x but its XCacheL copy is invalidated in T2 , so it fire a read miss , the following is going to happen :

● Processor makes bus request to memory
● Snooping cache puts copy value on the bus
● Memory access is abandoned
● Local processor caches value
● Local copy tagged S
● Source (M) value copied back to memory
● Source value M -> S

so after T3 , XCacheL is core 0 and core 4 status : Shared , and Invalidated in core 5 , and also L3$ and main memory has the newest valid XCacheL .

T4 : core 5 require to read x , since its XCacheL copy is Invalidated in T2 , but this monent XCacheL has the correct copy in L3$ , Would core 5 need to fire a read miss like core 4 do ?!

My guess is : no need , since L3$ has the valid XCacheL, so core 5 can reach L3$ and get the right XCacheL from L3$ to L1$ in core 5 , so core 5 won't fire a read miss .


Solution

  • You're right, in your T4 step, core #5's load will hit in L3, so no memory access happens. Core #5 gets another copy of the line, in Shared state.


    Your sequence of steps makes zero sense for a CPU like Broadwell where all cores share access to on-chip DRAM controller(s).

    A ring bus connects cores (each of which has a slice of L3 cache) and the System Agent (PCIe links and connection to other cores) and Home Agent (memory controllers). See https://en.wikichip.org/wiki/intel/microarchitectures/broadwell_(client)#Die_Stats for a block diagram showing the ring bus.

    Individual cores don't directly drive "the memory bus", or even one of the 2 or 4 DRAM buses. The memory controller arbitrates access to DRAM, and has some buffering to reorder / combine accesses. (Everything that accesses memory goes through it, including DMA, so it can do whatever it likes as long as it gives the appearance of loads/stores happening in some sane order.)

    A load request won't be sent to the system agent until after it misses in L3 cache. See https://superuser.com/questions/1226197/x86-address-space-controller/1226198#1226198 for an illustration of a quad-core desktop (which is simpler and just has the memory controller connected to the System Agent, making it exactly like a Northbridge before CPUs integrated the memory controllers.)


    Since Broadwell uses an inclusive L3 cache, L3 tags can tell it which, if any, core has a Modified or Exclusive copy, even if the line in L3 itself isn't shareable. (i.e. a line's data can be Invalid in L3, but the tags are still tracking which core has a private copy). See Which cache mapping technique is used in intel core i7 processor?

    This lets L3 tags act as a snoop filter to reduce broadcasts.