I am benchmarking the following copy
function (not fancy!) with a high size
argument (~1GB):
void copy(unsigned char* dst, unsigned char* src, int count)
{
for (int i = 0; i < count; ++i)
{
dst[i] = src[i];
}
}
I built this code with GCC 6.2, with -O3 -march=native -mtune-native
, on a Xeon E5-2697 v2.
Just for you to look at the assembly generated by gcc
on my machine, I paste here the assembly generated in the inner loop:
movzx ecx, byte ptr [rsi+rax*1]
mov byte ptr [rdi+rax*1], cl
add rax, 0x1
cmp rdx, rax
jnz 0xffffffffffffffea
Now, as my LLC is ~25MB and I am copying ~1GB, it makes sense that this code is memory bound. perf
confirms this with a high number of stalled frontend cycles:
6914888857 cycles # 2,994 GHz
4846745064 stalled-cycles-frontend # 70,09% frontend cycles idle
<not supported> stalled-cycles-backend
8025064266 instructions # 1,16 insns per cycle
# 0,60 stalled cycles per insn
My first question is about 0.60 stalled cycles per instruction. This seems like a very low number for such code that access LLC/DRAM all the time as the data is not cached. As LLC latency is 30cycles and main memory around 100 cycles, how is this achieved?
My second question is related; it seems that the prefetcher is doing a relatively good job (not surprising, it's an array, but still): we hit 60% of the time the LLC instead of DRAM. Still, what is the reason for it to fail the other times? Which bandwidth/part of the uncore made this prefetcher fails to accomplish its task?
83788617 LLC-loads [50,03%]
50635539 LLC-load-misses # 60,43% of all LL-cache hits [50,04%]
27288251 LLC-prefetches [49,99%]
24735951 LLC-prefetch-misses [49,97%]
Last, but not least: I know that Intel can pipeline instructions; is it also the case for such mov
with memory operands?
Thank you very much!
TL;DR: There are a total of 5 uops in the unfused domain (See: Micro fusion and addressing modes). The loop stream detector on Ivy Bridge cannot allocate uops across the loop body boundaries (See: Is performance reduced when executing loops whose uop count is not a multiple of processor width?), so it takes two cycles to allocate one iteration. The loop actually runs at 2.3c/iter on dual socket Xeon E5-2680 v2 (10 cores per socket vs. your 12), so that is close to the best that can be done given the front-end bottleneck.
The prefetchers have performed very well, and most of the time the loop is not memory-bound. Copying 1 byte per 2 cycles is very slow. (gcc did a poor job, and should have given you a loop that could run at 1 iteration per clock. Without profile-guided optimization, even -O3
doesn't enable -funroll-loops
, but there are tricks it could have used (like counting a negative index up toward zero, or indexing the load relative to the store and incrementing a destination pointer) that would have brought the loop down to 4 uops.)
The extra .3 cycles per iteration slower than the front-end bottleneck on average is probably coming from stalls when prefetching fails (maybe at page boundaries), or maybe from page faults and TLB misses in this test that runs over statically-initialized memory in the .data
section.
There are two data dependencies in the loop. First, the store instruction (specifically the STD uop) depends on the result of the load instruction. Second, both the store and load instructions depend on add rax, 0x1
. In fact, add rax, 0x1
depends on itself as well. Since the latency of add rax, 0x1
is one cycle, an upper bound on performance of the loop is 1 cycle per iteration.
Since the store (STD) depends on the load, it cannot be dispatched from the RS until the load completes, which takes at least 4 cycles (in case of an L1 hit). In addition, there is only one port that can accept STD uops yet up to two loads can complete per cycle on Ivy Bridge (especially in the case the two loads are to lines that are resident in the L1 cache and no bank conflict occurs), resulting in additional contention. However, RESOURCE_STALLS.ANY
shows that the RS actual never gets full. IDQ_UOPS_NOT_DELIVERED.CORE
counts the number of issue slots that were not utilized. This is equal to 36% of all slots. The LSD.CYCLES_ACTIVE
event shows that the LSD is used to deliver uops most of the time. However, LSD.CYCLES_4_UOPS
/LSD.CYCLES_ACTIVE
=~ 50% shows that in about 50% of the cycles, less than 4 uops are delivered to the RS. The RS will not get full because of the sub-optimal allocation throughput.
The stalled-cycles-frontend
count corresponds to UOPS_ISSUED.STALL_CYCLES
, which counts allocation stalls due to both frontend stalls and backend stalls. I don't understand how UOPS_ISSUED.STALL_CYCLES
is related to the number of cycles and other events.
The LLC-loads
count includes:
LLC-load-misses
is a subset of LLC-loads
and includes only those events that missed in the L3. Both are counted per core.
There is an important difference between counting requests (cache-line granularity) and counting load instructions or load uops (using MEM_LOAD_UOPS_RETIRED.*
). Both the L1 and L2 caches squash load requests to the same cache line, so multiple misses in the L1 may result in a single request to the L3.
Optimal performance can be achieved if all stores and loads hit in the L1 cache. Since the size of a buffer you used is 1GB, the loop can cause a maximum of 1GB/64 =~ 17M L3 demand load requests. However, your LLC-loads
measurement, 83M, is much larger, probably due to code other than the loop you've shown in the question. Another possible reason is that you forgot to use the :u
suffix to count only user-mode events.
My measurements on both IvB and HSW show that LLC-loads:u
is negligible compared to 17M. However, most of the L3 loads are misses (i.e., LLC-loads:u
=~ LLC-loads-misses:u
). CYCLE_ACTIVITY.STALLS_LDM_PENDING
shows that the overall impact of loads on performance is negligible. In addition, my measurements show that the loop runs at 2.3c/iter on IvB (vs. 1.5c/iter on HSW), which suggests that one load is issued every 2 cycles. I think that the sub-optimal allocation throughput is the main reason for this. Note that 4K aliasing conditions (LD_BLOCKS_PARTIAL.ADDRESS_ALIAS
) are almost non-existent. All of this means that the prefetchers have done a pretty good job at hiding the memory access latency for most loads.
Counters on IvB that can be used to evaluate the performance of hardware prefetchers:
Your processor has two L1 data prefetchers and two L2 data prefetchers (one of them can prefetch both into the L2 and/or the L3). A prefetcher may not be effective for the following reasons:
The number of demand misses at the L1, L2, and L3 are good indicators of how well the prefetchers have performed. All the L3 misses (as counted by LLC-load-misses
) are also necessarily L2 misses, so the number of L2 misses is larger than LLC-load-misses
. Also all of the demand L2 misses are necessarily L1 misses.
On Ivy Bridge, you can use the LOAD_HIT_PRE.HW_PF
and CYCLE_ACTIVITY.CYCLES_*
performance events (in addition to the miss events) to know more about how the prefetchers have performed and evaluate their impact on performance. It's important to measure CYCLE_ACTIVITY.CYCLES_*
events because even if the miss counts were seemingly high, that doesn't necessarily mean that misses are the main cause of performance degradation.
Note that the L1 prefetchers cannot issue speculative RFO requests. Therefore, most writes that reach the L1 will actually miss, requiring the allocation of an LFB per cache line at the L1 and potentiality other levels.
The code I used follows.
BITS 64
DEFAULT REL
section .data
bufdest: times COUNT db 1
bufsrc: times COUNT db 1
section .text
global _start
_start:
lea rdi, [bufdest]
lea rsi, [bufsrc]
mov rdx, COUNT
mov rax, 0
.loop:
movzx ecx, byte [rsi+rax*1]
mov byte [rdi+rax*1], cl
add rax, 1
cmp rdx, rax
jnz .loop
xor edi,edi
mov eax,231
syscall