I am looking for inline assembly operation for add reduce operation for Xeon Phi. I found _mm512_reduce_add_epi32 intrinsic on intel intrinsic website (link). However in the website, they did not mentioned the actual assembly operation for it.
Can anybody help me to found the inline assembly of reduction operation on Xeon Phi platform?
Thanks
Doing a reduction of 16 integers with KNC is an interesting case to show why it differs from AVX512.
The _mm512_reduce_add_epi32
intrinsic is only supported by the Intel compile (currently). It's one of those annoying many instruction intrinsics like in SVML. But I think I understand why Intel implemented this intrinsic as in this case because the result for KNC and AVX512 are very different.
With AVX512 I would do something like this
__m256i hi8 = _mm512_extracti64x4_epi64(a,1);
__m256i lo8 = _mm512_castsi512_si256(a);
__m256i vsum1 = _mm256_add_epi32(hi8,lo8);
and the then I would do a reduction just like in AVX2
__m256i vsum2 = _mm256_hadd_epi32(vsum1,vsum1);
__m256i vsum3 = _mm256_hadd_epi32(vsum2,vsum2);
__m128i hi4 = _mm256_extracti128_si256(vsum3,1);
__m128i lo4 = _mm256_castsi256_si128(vsum3);
__m128i vsum4 = _mm_add_epi32(hi4, lo4);
int sum = _mm_cvtsi128_si32(vsum4);
It would be interesting to see how Intel implements _mm512_reduce_add_epi32
with AVX512.
But the KNC instruction set does not support AVX or SSE so everything has to be done with the full 512-bit vectors with KNC. Intel has created instructions unique to KNC to do this.
Looking at the assembly from Giles answer we can see what it does. First it permutes the upper 256-bits to the lower 256-bits using an instruction unique to KNC like this:
vpermf32x4 $238, %zmm0, %zmm1
The value 238
is 3232
in base 4. So zmm1
in terms of the four 128-bit lanes is (3,2,3,2)
.
Next it does a vector sum
vpaddd %zmm0, %zmm1, %zmm3
which gives the four 128-bit lanes (3+3, 2+2, 3+1, 2+0)
Then it permutes the second 128-bit lane giving (3+1, 3+1, 3+1, 3+1)
like this
vpermf32x4 $85, %zmm3, %zmm2
where 85
is 1111
in base 4. Then it adds these together
vpaddd %zmm3, %zmm2, %zmm4
so that the lower 128-bit lane in zmm4
contains the sum of the four 128-bit lanes (3+2+1+0)
.
At this point it needs to permute the 32-bit values within each 128-bit lane. Again it uses a unique feature of KNC which allows it to permute and add at the same time (or at least the notation is unique).
vpaddd %zmm4{badc}, %zmm4, %zmm5
produces (a+b, a+b, c+d, c+d)
and
vpaddd %zmm5{cdab}, %zmm5, %zmm6
produces (a+b+c+d , a+b+c+d , a+b+c+d, a+b+c+d)
. Now it is just a matter of extracting the lower 32-bits.
Here is an alternative solution for AVX512 which is similar to the solution for KNC
#include <x86intrin.h>
int foo(__m512i a) {
__m512i vsum1 = _mm512_add_epi32(a,_mm512_shuffle_i64x2(a,a, 0xee));
__m512i vsum2 = _mm512_add_epi32(a,_mm512_shuffle_i64x2(vsum1,vsum1, 0x55));
__m512i vsum3 = _mm512_add_epi32(a,_mm512_shuffle_epi32(vsum2, _MM_PERM_BADC));
__m512i vsum4 = _mm512_add_epi32(a,_mm512_shuffle_epi32(vsum3, _MM_PERM_CADB));
return _mm_cvtsi128_si32(_mm512_castsi512_si128(vsum4));
}
With gcc -O3 -mavx512f
this gives.
vshufi64x2 $238, %zmm0, %zmm0, %zmm1
vpaddd %zmm1, %zmm0, %zmm1
vshufi64x2 $85, %zmm1, %zmm1, %zmm1
vpaddd %zmm1, %zmm0, %zmm1
vpshufd $78, %zmm1, %zmm1
vpaddd %zmm0, %zmm1, %zmm1
vpshufd $141, %zmm1, %zmm1
vpaddd %zmm0, %zmm1, %zmm0
vmovd %xmm0, %eax
ret
AVX512 uses vshufi64x2
instead of vpermf32x4
and KNC combines the permuting within lanes and the add with the {abcd} notation (e.g. vpaddd %zmm4{badc}, %zmm4, %zmm5
). This is basically what is achieved using _mm256_hadd_epi32
.
I forgot I already had seen this question for AVX512. Here is another solution.
For what it's worth here is intrinsics (untested) for KNC.
int foo(__m512i a) {
__m512i vsum1 = _mm512_add_epi32(a,_mm512_permute4f128_epi32(a, 0xee));
__m512i vsum2 = _mm512_add_epi32(a,_mm512_permute4f128_epi32(vsum1, 0x55));
__m512i vsum3 = _mm512_add_epi32(a,_mm512_swizzle_epi32(vsum2, _MM_SWIZ_REG_BADC));
__m512i vsum4 = _mm512_add_epi32(a,_mm512_swizzle_epi32(vsum3, _MM_SWIZ_REG_CADB));
int32_t out[2];
_mm512_packstorelo_epi32(out, vsum4);
return out[0];
}
I don't see a difference between in functionality between KNC's _mm512_permute4f128_epi32(a,imm8
) and AVX512's _mm512_shuffle_i32x4(a,a,imm8)
.
The main difference in this case is that _mm512_shuffle_epi32
generates vpshufd
whereas _mm512_swizzle_epi32
does not. That appears to
be an advantage of KNC over AVX512.