Related to Understanding `_mm_prefetch`.
I understood that _mm_prefetch()
causes the requested value to be fetched into processor's cache, and my code will be executed while something pre-fetches.
However, my VS2017 profiler states that 5.7% is spent on the line that accesses my cache
and 8.63% on the _mm_prefetch
line. Is the profiler mistaken? If I am waiting for the data to be fetched, what do I need it for? I could wait in the next function call, when I need it...
On the other hand, the overall timing shows significant benefit of that prefetch call.
So the question is: is the data being fetch asynchronously?
Additional information.
I have multiple caches, for various key width, up to 32-bit keys (that I am currently profiling). The access to cache and pre-fetching are extracted into separate __declspec(noinline)
functions to isolate them from surrounding code.
uint8_t* cache[33];
__declspec(noinline)
uint8_t get_cached(uint8_t* address) {
return *address;
}
__declspec(noinline)
void prefetch(uint8_t* pcache) {
_mm_prefetch((const char*)pcache, _MM_HINT_T0);
}
int foo(const uint64_t seq64) {
uint64_t key = seq64 & 0xFFFFFFFF;
uint8_t* pcache = cache[32];
int x = get_cached(pcache + key);
key = (key * 2) & 0xFFFFFFFF;
pcache += key;
prefetch(pcache);
// code that uses x
}
The profiler shows 7.22% for int x = get_cached(pcache + key);
line and 8.97% for prefetch(pcache);
, while surrounding code shows 0.40-0.45% per line.
Substantially everything on an out-of-order CPU is "asynchronous" in the way you describe (really, running in parallel and out of order). In that sense, prefetch isn't really different than regular loads, which can also run out of order or "async" with other instructions.
Once that is understood, the exact behavior of prefetch is hardware dependent, but it is my observation that:
On Intel, prefetch instructions can retire before their data arrives. So a prefetch instruction that successfully begins execution won't block the CPU pipeline after that. However, note "successfully executes": the prefetch instruction still requires a line fill buffer (MSHR) if it misses in L1 and on Intel it will wait for that resource if not available. So if you issue a lot of prefetch misses in parallel, they end up waiting for fill buffers which makes them act quite similarly to vanilla loads in that scenario.
On AMD Zen [2], prefetches do not wait for a fill buffer if none is available. Presumably, the prefetch is simply dropped. So a large number of prefetch misses behave quite differently than Intel: they will complete very quickly, regardless if they miss or not, but many of the associated lines will not actually be fetched.