Search code examples
performanceintelcpu-architecturememory-access

Multiple accesses to main memory and out-of-order execution


Let us assume that I have two pointers that are pointing to unrelated addresses that are not cached, so they will both have to come all the way from main memory when being dereferenced.

int load_and_add(int *pA, int *pB)
{
    int a = *pA;   // will most likely miss in cache
    int b = *pB;   // will most likely miss in cache 

    // ...  some code that does not use a or b

    int c = a + b;
    return c;
}

If out-of-order execution allows executing the code before the value of c is computed, how will the fetching of values a and b proceed on a modern Intel processor?

Are the potentially-pipelined memory accesses completely serialized or may there be some sort of fetch overlapping performed by the CPU's memory controller?

In other words, if we assume that hitting main memory costs 300 cycles. Will fetching a and b cost 600 cycles or does out-of-order execution enable some possible overlap and perhaps cost less cycles?


Solution

  • Modern CPUs have multiple load buffers so multiple loads can be outstanding at the same time. The memory subsystem is heavily pipelined, giving many parts of it much better throughput than latency. (e.g. with prefetching, Haswell can sustain (from main memory) an 8B load every 1 clock. But the latency if the address isn't known ahead of time is in the hundreds of cycles).

    So yes, a Haswell core can keep track of up to 72 outstanding load uops waiting for data from cache / memory. (This is per-core. The shared L3 cache also needs some buffers to handle the whole system's loads / stores to DRAM and memory-mapped IO.)

    Haswell's ReOrder Buffer size is 192 uops, so up to 190 uops of work in the code that does not use a or b can be issued and executed while the loads of a and b are the oldest instructions that haven't retired. Instructions / uops are retired in-order to support precise exceptions. The ROB size is basically the limit of the out-of-order window for hiding latency of slow operations like cache-misses.

    Also see other links at the tag wiki to learn how CPUs work. Agner Fog's microarch guide is great for having a mental model of the CPU pipeline to let you understand approximately how code will execute.

    From David Kanter's Haswell writeup: Intel Haswell, from David Kanter's article