Imagine a load-store loop like the following which loads DWORD
s from non-contiguous locations and stores them contiguously:
top:
mov eax, DWORD [rsi]
mov DWORD [rdi], eax
mov eax, DWORD [rdx]
mov DWORD [rdi + 4], eax
; unroll the above a few times
; increment rdi and rsi somehow
cmp ...
jne top
On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load).
One idea that naturally arises is to combine two DWORD
loads into a single QWORD
store which is possible since the stores are contiguous. Something like this could work:
top:
mov eax, DWORD [rsi]
mov ebx, DWORD [rdx]
shl rbx, 32
or rax, rbx
mov QWORD [rdi]
Basically do the two loads and use two ALU ops to combine them into a single QWORD
which we can store with a single store. Now we're bottlenecked on uops: 5 uops per 2 DWORD
s - so 1.25 cycles per QWORD
or 0.625 cycles per DWORD
.
Already much better than the first option, but I can't help but think there is a better option for this shuffling - for example, we are wasting uop throughput by using plain loads - It feels like we should be able to combine at least some of the ALU ops with the loads with memory source operands, but I was mostly stymied on Intel: shl
on memory only has a RMW form, and shlx
and rolx
don't micro-fuse.
It also seems like we could maybe get the shift for free by making the second load a QWORD
load offset by -4
, but then we are left clearing out garbage in the load DWORD
.
I'm interested in scalar code, and code for both the base x86-64 instruction set and better versions if possible with useful extensions like BMI
.
It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.
If wider loads are ok for correctness and performance (cache-line splits...), we can use shld
top:
mov eax, DWORD [rsi]
mov rbx, QWORD [rdx-4] ; unaligned(?) 64-bit load
shld rax, rbx, 32 ; 1 uop on Intel SnB-family, 0.5c recip throughput
mov QWORD [rdi], rax
MMX punpckldq mm0, [mem]
micro-fuses on SnB-family (including Skylake).
top:
movd mm0, DWORD [rsi]
punpckldq mm0, QWORD [rdx] ; 1 micro-fused uop on Intel SnB-family
movq QWORD [rdi], mm0
; required after the loop, making it only worth-while for long-running loops
emms
punpckl instructions unfortunately have a vector-width memory operand, not half-width. This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned). But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.
You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow. (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov
+ vpunckldq xmm1, xmm0, [cache_line-8]
runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.) The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/
Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX). At least we have movhps
as a replacement for memory-source punpcklqdq
, but narrower widths that actually shuffle can't be replaced.
To avoid CL-splits, you could also use a separate movd
load and punpckldq
, or SSE4.1 pinsrd
. With this, there's no reason for MMX.
top:
movd xmm0, DWORD [rsi]
movd xmm1, DWORD [rdx] ; SSE2
punpckldq xmm0, xmm1
; or pinsrd xmm0, DWORD [rdx], 1 ; 2 uops not micro-fused
movq QWORD [rdi], xmm0
Obviously AVX2 vpgatherdd
is a possibility, and may perform well on Skylake.