I'm a newbie for coding using AVX512 instructions. My machine is Intel KNL 7250. I am trying to use AVX512 instructions to find the INDEX of the element having maximum absolute value, which double precision and size of array % 8 = 0. But it prints an output index = 0 every time. I don't know where is a problem, please help me. Also, how to use printf for __m512i type?
Thanks.
Code:
void main()
{
int i;
int N=160;
double vec[N];
for(i=0;i<N;i++)
{
vec[i]=(double)(-i) ;
if(i==10)
{
vec[i] = -1127;
}
}
int max = avxmax_(N, vec);
printf("maxindex=%d\n", max);
}
int avxmax_(int N, double *X )
{
// return the index of element having maximum absolute value.
int maxindex, ix, i, M;
register __m512i increment, indices, maxindices, maxvalues, absmax, max_values_tmp, abs_max_tmp, tmp;
register __mmask8 gt;
double values_X[8];
double indices_X[8];
double maxvalue;
maxindex = 1;
if( N == 1) return(maxindex);
M = N % 8;
if( M == 0)
{
increment = _mm512_set1_epi64(8); // [8,8,8,8,8,8,8,8]
indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
maxindices = indices;
maxvalues = _mm512_loadu_si512(&X[0]);
absmax = _mm512_abs_epi64(maxvalues);
for( i = 8; i < N; i += 8)
{
// advance scalar indices: indices[0] + 8, indices[1] + 8,...,indices[7] + 8
indices = _mm512_add_epi64(indices, increment);
// compare
max_values_tmp = _mm512_loadu_si512(&X[i]);
abs_max_tmp = _mm512_abs_epi64(max_values_tmp);
gt = _mm512_cmpgt_epi64_mask(abs_max_tmp, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epi64(absmax, abs_max_tmp);
}
// scalar part
_mm512_storeu_si512((__m512i*)values_X, absmax);
_mm512_storeu_si512((__m512i*)indices_X, maxindices);
maxindex = indices_X[0];
maxvalue = values_X[0];
for(i = 1; i < 8; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return(maxindex);
}
}
Your function returns 0
because you're treating the int64 index as the bit-pattern for a double
, and converting that (tiny) number to an integer. double indices_X[8];
is the bug; should be uint64_t
. There are other bugs, see below.
This bug is easier to spot if you declare variables as you use them, C99 style, not obsolete C89 style.
You _mm512_storeu_si512
the vector of int64_t
indices into double indices_X[8]
, type-punning it to double, then in plain C do int maxindex = indices_X[0];
. This is implicit type-conversion, converting that subnormal double
to an integer.
(I noticed a mysterious vcvttsd2si
FP->int conversion in the asm https://godbolt.org/z/zsfc36 while converting the code to C99 style variable declarations next to initializers. That was a clue: there should be no FP->int conversion in this function. I noticed that around the same time I was moving the double indices_X[8];
declaration down into the block that uses it, and noticing it had type double
.)
But only if you use the right ones! IEEE754 exponent biases mean that the encoding / bit-pattern can be compared as a sign/magnitude integer. So you can do abs / min / max and compare on it, but not of course integer add / sub (unless you're implementing nextafter
).
_mm512_abs_epi64
is 2's complement abs, not sign-magnitude. Instead, you must just mask off the sign bit. Then you're all set to treat the result as an unsigned integer or signed-2's-complement. (Either works because the high bit is clear.)
Using integer max has the interesting property that NaNs will compare higher than any other value, Inf below that, then finite values. So we get a NaN-propagating max-finder basically for free.
On KNL (Knight's Landing), FP vmaxpd
and vcmppd
have the same performance as their integer equivalents: 2 cycle latency, 0.5c throughput. (https://agner.org/optimize/). So your way has zero advantage on KNL, but it's a neat trick for mainstream Intel, like Skylake-X and IceLake.
Use size_t
for return type and loop counters / indices to handle potentially huge arrays, instead of a random mix of int
and 64-bit vector elements. (uint64_t
for the temp array that collects the horizontal-max: it's always 64-bit even in a build with 32-bit pointers / size_t.)
bugfix: return 0
on N==1, not 1: the index of the only element is 0.
bugfix: return -1
on N%8 != 0
, instead of falling off the end of the non-void function. (Undefined behaviour if the caller uses the result in C, or as soon as execution falls off the end in C++).
bugfix: abs of an FP value = clear the sign bit, not 2's complement abs on the bit-pattern
sort of bugfix: use unsigned integer compare and max, so it would work for 2's complement integers with _mm512_abs_epi64
(which produces an unsigned result; remember that -LONG_MIN
overflows to LONG_MIN
if you keep treating it as signed).
style improvement: if (N%8 != 0) return -1;
instead of putting most of the body in an if block.
style improvement: declare vars when they're first used, and removed some unused ones that were pure noise. This is idiomatic for C since C99, which was standardized over 20 years ago.
style improvement: use simpler names for tmp vector vars that just hold a load result. Sometimes you just need a tmp var because intrinsic names are so long that you don't want to type _mm...load...
as an arg for another intrinsics. A name like v
scoped to the inner loop is a clear sign it's just a placeholder, not used later. (This style works best when you're declaring it as you init it, so it's easy to see it can't be used in an outer scope.)
optimization: reduce 8 -> 4 elements after the loop with SIMD: extract the high half, combine with existing low half. (Same as you would for a simpler horizontal reduction like sum or max). Inconvenient when we need instructions that only AVX512 has, but KNL doesn't have AVX512VL, so we have to use the 512-bit version and ignore the high garbage. But KNL does have AVX1 / AVX2 so we can still store 256-bit vectors and do some things.
Using a merge-masking _mm512_mask_extracti64x4_epi64
extract to blend the high half directly the low half of the same vector is a cool trick which compilers don't find if you use a 512-bit mask-blend. :P
sort of bugfix: in C, main
has a return type of int
in hosted implementations (running under an OS).
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
// bugfix: indices can be larger than an int
size_t avxmax_(size_t N, double *X )
{
// return the index of element having maximum absolute value.
if( N == 1)
return 0; // bugfix: 0 is the only valid element in this case, not 1
if( N % 8 != 0) // [[unlikely]] // C++20
return -1; // bugfix: don't fall off the end of the function in this case
const __m512i fp_absmask = _mm512_set1_epi64(0x7FFFFFFFFFFFFFFF);
__m512i indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
__m512i maxindices = indices;
__m512i v = _mm512_loadu_si512(&X[0]);
__m512i absmax = _mm512_and_si512(v, fp_absmask);
for(size_t i = 8; i < N; i += 8) // [[likely]] // C++20
{
// advance indices by 8 each.
indices = _mm512_add_epi64(indices, _mm512_set1_epi64(8));
// compare
v = _mm512_loadu_si512(&X[i]);
__m512i vabs = _mm512_and_si512(v, fp_absmask);
// vabs = _mm512_abs_epi64(max_values_tmp); // for actual integers, not FP bit patterns
__mmask8 gt = _mm512_cmpgt_epu64_mask(vabs, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epu64(absmax, vabs);
}
// reduce 512->256; KNL doesn't have AVX512VL so some ops require 512-bit vectors
__m256i absmax_hi = _mm512_extracti64x4_epi64(absmax, 1);
__m512i absmax_hi512 = _mm512_castsi256_si512(absmax_hi); // free
__mmask8 gt = _mm512_cmpgt_epu64_mask(absmax_hi512, absmax);
__m256i abs256 = _mm512_castsi512_si256(_mm512_max_epu64(absmax_hi512, absmax)); // reduced to low 4 elements
// extract with merge-masking = blend
__m256i maxindices256 = _mm512_mask_extracti64x4_epi64(
_mm512_castsi512_si256(maxindices), gt, maxindices, 1);
// scalar part
double values_X[4];
uint64_t indices_X[4];
_mm256_storeu_si256((__m256i*)values_X, abs256);
_mm256_storeu_si256((__m256i*)indices_X, maxindices256);
size_t maxindex = indices_X[0];
double maxvalue = values_X[0];
for(int i = 1; i < 4; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return maxindex;
}
On Godbolt: the main loop from GCC10.2 -O3 -march=knl is 8 instructions. So even if (best case) KNL could decode and run it at 2/clock, it's still taking 4 cycles per vector. You can run the program on Godbolt; it runs on Skylake-X servers so it can run AVX512 code. You can see it prints 10
.
.L4:
vpandd zmm2, zmm5, ZMMWORD PTR [rsi+rax*8] # load, folded into AND
add rax, 8
vpcmpuq k1, zmm2, zmm0, 6
vpaddq zmm1, zmm1, zmm4 # increment current indices
cmp rdi, rax
vmovdqa64 zmm3{k1}, zmm1 # blend maxidx using merge-masking
vpmaxuq zmm0, zmm0, zmm2
ja .L4
vmovapd zmm1, zmm3 # silly missed optimization related to the case where the loop runs 0 times.
.L3:
vextracti64x4 ymm2, zmm0, 0x1 # high half of absmax
vpcmpuq k1, zmm2, zmm0, 6 # compare high and low
vpmaxuq zmm0, zmm0, zmm2
# vunpckhpd xmm2, xmm0, xmm0 # setting up for unrolled scalar loop
vextracti64x4 ymm1{k1}, zmm3, 0x1 # masked extract of indices
Another option for the loop would be a masked vpbroadcastq zmm3{k1}, rax
, adding the [0..7]
per-element offsets after the loop. That would actually save the vpaddq
in the loop, and we have the right i
in a register if GCC is going to use an indexed addressing-mode anyway. (That's not good on Skylake-X; defeats micro-fusion of the memory-source vpandd
.)
Agner Fog doesn't list performance for GP->vector broadcasts, but hopefully it's only single-uop on KNL at least. (And https://uops.info/ doesn't have KNL or KNM results).
If you expect finding a new max to be very rare (e.g. array is large and uniformly distributed, or at least not trending upwards), it could be even faster to broadcast the current max and branch on finding any greater vector element.
Finding a new max means branching out of the loop (which probably mispredicts, so that's slow) and broadcasting that element (probably with a tzcnt
to find the element index, then a broadcast-load, and update the index).
Especially with KNL's 4-way SMT to hide branch miss costs, this could be an overall throughput win for large arrays; fewer instructions per element on average.
But probably significantly worse for inputs that do trend upwards, so a new max is found O(n) times on average, not sqrt(n) or log(n) or whatever frequency a uniform distribution would give us.
PS: to print vectors, store to an array and reload the elements. print a __m128i variable
Or use a debugger to show you their elements.