I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations.
Here is the code I am using
void daxpy2(int n, double a, const double x[], double y[]) {
__m512d av = _mm512_set1_pd(a);
int r = n&7, n2 = n - r;
for(int i=-n2; i<0; i+=8) {
__m512d yv = _mm512_loadu_pd(&y[i+n2]);
__m512d xv = _mm512_loadu_pd(&x[i+n2]);
yv = _mm512_fmadd_pd(av, xv, yv);
_mm512_storeu_pd(&y[i+n2], yv);
}
__m512d yv = _mm512_loadu_pd(&y[n2]);
__m512d xv = _mm512_loadu_pd(&x[n2]);
yv = _mm512_fmadd_pd(av, xv, yv);
__mmask8 mask = (1 << r) -1;
//__mmask8 mask = _bextr_u32(-1, 0, r);
_mm512_mask_storeu_pd(&y[n2], mask, yv);
}
I thought using BMI1 and/or BMI2 instructions could generate masks with fewer instructions. However,
__mmask8 mask = _bextr_u32(-1, 0, r)
is no better (in number of instructions) than
__mmask8 mask = (1 << r) -1;
see https://godbolt.org/z/BFQCM3 and https://godbolt.org/z/tesmB_.
This appears to be due to the fact that _bextr_u32 does a shift by 8 anyway.
Can the mask be generated with fewer instructions (e.g. with BMI or another method) or more optimally?
I have augmented the table in the link with my results for AVX512.
ISA | MIPS-32 | AVX2 | RV32V | AVX512 |
******************************|*********|****** |*******|******* |
Instructions(static) | 22 | 29 | 13 | 28 |
Instructions per Main Loop | 7 | 6* | 10 | 5*|
Bookkeeping Instructions | 15 | 23 | 3 | 23 |
Results per Main Loop | 2 | 4 | 64 | 8 |
Instructions (dynamic n=1000) | 3511 | 1517**| 163 | 645 |
*macro-op fusion will reduce the number of uops in the main loop by 1
** without the unnecessary cmp instructions it would only be 1250+ instructions.
I think if the authors of the link had counted from -n
up to 0
instead of from 0
to n
they could have skipped the cmp
instruction as I have (see the assembly below) in the main loop so for AVX it chould have been 5 instructions in the main loop.
Here is the assembly with ICC19 and -O3 -xCOMMON-AVX512
daxpy2(int, double, double const*, double*):
mov eax, edi #6.13
and eax, 7 #6.13
movsxd r9, edi #6.25
sub r9, rax #6.21
mov ecx, r9d #7.14
neg ecx #7.14
movsxd rcx, ecx #7.14
vbroadcastsd zmm16, xmm0 #5.16
lea rdi, QWORD PTR [rsi+r9*8] #9.35
lea r8, QWORD PTR [rdx+r9*8] #8.35
test rcx, rcx #7.20
jge ..B1.5 # Prob 36% #7.20
..B1.3: # Preds ..B1.1 ..B1.3
vmovups zmm17, ZMMWORD PTR [rdi+rcx*8] #10.10
vfmadd213pd zmm17, zmm16, ZMMWORD PTR [r8+rcx*8] #10.10
vmovups ZMMWORD PTR [r8+rcx*8], zmm17 #11.23
add rcx, 8 #7.23
js ..B1.3 # Prob 82% #7.20
..B1.5: # Preds ..B1.3 ..B1.1
vmovups zmm17, ZMMWORD PTR [rsi+r9*8] #15.8
vfmadd213pd zmm16, zmm17, ZMMWORD PTR [rdx+r9*8] #15.8
mov edx, -1 #17.19
shl eax, 8 #17.19
bextr eax, edx, eax #17.19
kmovw k1, eax #18.3
vmovupd ZMMWORD PTR [r8]{k1}, zmm16 #18.3
vzeroupper #19.1
ret #19.1
where
add r8, 8
js ..B1.3
should macro-op fuse to one instruction. However, as pointed out by Peter Cordes in this answer js cannot fuse. The compiler could have generated jl
instead which would have fused.
I used Agner Fog's testp utility to get the core clocks (not reference clocks), instructions, uops retired. I did this for SSE2 (actually AVX2 with FMA but with 128-bit vectors), AVX2, and AVX512 for three different variations of looping
v1 = for(int64_t i=0; i<n; i+=vec_size) // generates cmp instruction
v2 = for(int64_t i=-n2; i<0; i+=vec_size) // no cmp but uses js
v3 = for(int64_t i=-n2; i!=0; i+=vec_size) // no cmp and uses jne
vec_size = 2 for SSE, 4 for AVX2, and 8 for AVX512
vec_size version core cycle instructions uops
2 v1 895 3014 3524
2 v2 900 2518 3535
2 v3 870 2518 3035
4 v1 527 1513 1777
4 v2 520 1270 1777
4 v3 517 1270 1541
8 v1 285 765 910
8 v2 285 645 910
8 v3 285 645 790
Notice that core clocks is not really a function of the loop version. It only depends on iterations of the loop. It is proportional to 2*n/vec_size
.
SSE 2*1000/2=1000
AVX2 2*1000/4=500
AVX512 2*1000/8=250
The number of instructions does change from v1 to v2 but not between v2 and v3. For v1 it is proportional to 6*n/vec_size
and for v2 and v3 5*n/vec_size
Finally, the number of uops is more or less the same for v1 and v2 but drops for v3. For v1 and v2 it is proportional to 7*n/vec_size
and for v3 6*n/vec_size
.
Here is the result with IACA3 for vec_size=2
Throughput Analysis Report
--------------------------
Block Throughput: 1.49 Cycles Throughput Bottleneck: FrontEnd
Loop Count: 50
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 1.5 1.0 | 1.5 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | vmovupd xmm1, xmmword ptr [r8+rax*8]
| 2 | 0.5 | 0.5 | 0.5 0.5 | 0.5 0.5 | | | | | vfmadd213pd xmm1, xmm2, xmmword ptr [rcx+rax*8]
| 2 | | | 0.5 | 0.5 | 1.0 | | | | vmovups xmmword ptr [rcx+rax*8], xmm1
| 1* | | | | | | | | | add rax, 0x2
| 0*F | | | | | | | | | js 0xffffffffffffffe3
Total Num Of Uops: 6
IACA claims that js
macro-fuses with add
that disagrees with Agner and the performance counters from the testp
utility. See above, v2 is proportional to 7*n/vec_size
and v3 proportional to 6*n/vec_size
which I infer to mean that js
does not macro-fuse.
I think the authors of the link in additional to number of instructions should have also considered core cycles and maybe uops.
You can save one instruction if you use the following BMI2 intrinsic:
__mmask8 mask = _bzhi_u32(-1, r);
instead of __mmask8 mask = (1 << r) -1;
. See Godbolt link.
The bzhi
instruction zeros the high bits starting at a specified position. With register operands, bzhi
has a latency of 1 cycle and a throughput of 2 per cycle.