Search code examples
x86cpucpu-architectureprocessormicro-architecture

Is port blocked when data is fetching from cache or memory in CPU microarchitecture?


There are two identical memory read ports (port 2 and 3) and one write port (port 4) of Intel Skylake cores. Assuming there are two load instructions issued to port 2 and port 3 parallelly:

  1. When both data can be fetched from L1 cache ( about ~10ns), will port 2 and 3 be blocked until data is fetched and load instruction is retired?

  2. What if data is not available in cache and must be accessed from memory? Will load ports be blocked for a long time?

  3. Another guess, when data is fetching from cache or memory, data request will be cached in load cache in MOB and port is released for next load. It means that a port can serve multiple load simultaneously when data is on path from cache/memory to core?

It could be much better if there is some support material. I googled but found no answer.


Solution

  • The load execution units are fully pipelined, sustaining 2 loads per clock on cache hits. See https://agner.org/optimize/ and https://uops.info/, and note the experimental test results verifying sustained 2/clock execution load uops.

    Try it yourself with a loop like this, in a static executable, and run it under perf stat ./a.out and note that it runs the loop at 1 cycle per iteration (2 loads).

     mov rdi, rsp
     mov edx, 1000000000
    .loop
      mov eax, [rdi]
      mov ecx, [rdi+4]
      dec edx
      jnz .loop
    
     mov eax, 231
     syscall           ; Linux _exit(edi)
    

    Also see Intel's optimization manual, where you can see Skylake's sustained L1d bandwidth over 80 bytes per cycle (2 loads and 1 store, of 32 byte vectors). Apparently something sometimes prevents sustaining the full 2 loads + 1 store per clock, at least with vectors that wide, but it definitely doesn't stall.

    L1d cache miss doesn't stall either; load uops can keep executing until you run LFBs and stall. But even with the LFBs all waiting for incoming cache lines, loads that hit in L1d cache can still execute. Also, loads that load from the same cache line as another outstanding load can pile on to the same LFB. (Or you might also run out of load buffers, which would stop the alloc/rename stage from issuing more load uops into the back end.)

    Also, L1d cache hit latency is 5 cycles on modern Intel; that's just over 1 ns, not 10! https://www.7-cpu.com/cpu/Skylake.html

    See also https://www.realworldtech.com/haswell-cpu/.

    Also https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: cache misses eventually stalling.