inline assembly of reduce operation for Xeon Phi

I am looking for inline assembly operation for add reduce operation for Xeon Phi. I found _mm512_reduce_add_epi32 intrinsic on intel intrinsic website (link). However in the website, they did not mentioned the actual assembly operation for it.

Can anybody help me to found the inline assembly of reduction operation on Xeon Phi platform?

Thanks

Solution

Doing a reduction of 16 integers with KNC is an interesting case to show why it differs from AVX512.

The _mm512_reduce_add_epi32 intrinsic is only supported by the Intel compile (currently). It's one of those annoying many instruction intrinsics like in SVML. But I think I understand why Intel implemented this intrinsic as in this case because the result for KNC and AVX512 are very different.

With AVX512 I would do something like this

__m256i hi8 = _mm512_extracti64x4_epi64(a,1);
__m256i lo8 = _mm512_castsi512_si256(a);
__m256i vsum1 = _mm256_add_epi32(hi8,lo8);

and the then I would do a reduction just like in AVX2

__m256i vsum2  = _mm256_hadd_epi32(vsum1,vsum1);
__m256i vsum3  = _mm256_hadd_epi32(vsum2,vsum2);
__m128i hi4 = _mm256_extracti128_si256(vsum3,1);
__m128i lo4 = _mm256_castsi256_si128(vsum3);
__m128i vsum4 = _mm_add_epi32(hi4, lo4);
int sum = _mm_cvtsi128_si32(vsum4);

It would be interesting to see how Intel implements _mm512_reduce_add_epi32 with AVX512.

But the KNC instruction set does not support AVX or SSE so everything has to be done with the full 512-bit vectors with KNC. Intel has created instructions unique to KNC to do this.

Looking at the assembly from Giles answer we can see what it does. First it permutes the upper 256-bits to the lower 256-bits using an instruction unique to KNC like this:

vpermf32x4 $238, %zmm0, %zmm1

The value 238 is 3232 in base 4. So zmm1 in terms of the four 128-bit lanes is (3,2,3,2).

Next it does a vector sum

vpaddd    %zmm0, %zmm1, %zmm3

which gives the four 128-bit lanes (3+3, 2+2, 3+1, 2+0)

Then it permutes the second 128-bit lane giving (3+1, 3+1, 3+1, 3+1) like this

vpermf32x4 $85, %zmm3, %zmm2

where 85 is 1111 in base 4. Then it adds these together

vpaddd    %zmm3, %zmm2, %zmm4

so that the lower 128-bit lane in zmm4 contains the sum of the four 128-bit lanes (3+2+1+0).

At this point it needs to permute the 32-bit values within each 128-bit lane. Again it uses a unique feature of KNC which allows it to permute and add at the same time (or at least the notation is unique).

vpaddd    %zmm4{badc}, %zmm4, %zmm5

produces (a+b, a+b, c+d, c+d)

and

vpaddd    %zmm5{cdab}, %zmm5, %zmm6

produces (a+b+c+d , a+b+c+d , a+b+c+d, a+b+c+d). Now it is just a matter of extracting the lower 32-bits.

Here is an alternative solution for AVX512 which is similar to the solution for KNC

#include <x86intrin.h>  
int foo(__m512i a) {   
    __m512i vsum1 = _mm512_add_epi32(a,_mm512_shuffle_i64x2(a,a, 0xee));
    __m512i vsum2 = _mm512_add_epi32(a,_mm512_shuffle_i64x2(vsum1,vsum1, 0x55));
    __m512i vsum3 = _mm512_add_epi32(a,_mm512_shuffle_epi32(vsum2, _MM_PERM_BADC));
    __m512i vsum4 = _mm512_add_epi32(a,_mm512_shuffle_epi32(vsum3, _MM_PERM_CADB));
    return _mm_cvtsi128_si32(_mm512_castsi512_si128(vsum4));
}

With gcc -O3 -mavx512f this gives.

vshufi64x2      $238, %zmm0, %zmm0, %zmm1
vpaddd          %zmm1, %zmm0, %zmm1
vshufi64x2      $85, %zmm1, %zmm1, %zmm1
vpaddd          %zmm1, %zmm0, %zmm1
vpshufd         $78, %zmm1, %zmm1
vpaddd          %zmm0, %zmm1, %zmm1
vpshufd         $141, %zmm1, %zmm1
vpaddd          %zmm0, %zmm1, %zmm0
vmovd           %xmm0, %eax
ret

AVX512 uses vshufi64x2 instead of vpermf32x4 and KNC combines the permuting within lanes and the add with the {abcd} notation (e.g. vpaddd %zmm4{badc}, %zmm4, %zmm5). This is basically what is achieved using _mm256_hadd_epi32.

I forgot I already had seen this question for AVX512. Here is another solution.

For what it's worth here is intrinsics (untested) for KNC.

int foo(__m512i a) {
    __m512i vsum1 = _mm512_add_epi32(a,_mm512_permute4f128_epi32(a, 0xee));
    __m512i vsum2 = _mm512_add_epi32(a,_mm512_permute4f128_epi32(vsum1, 0x55));
    __m512i vsum3 = _mm512_add_epi32(a,_mm512_swizzle_epi32(vsum2, _MM_SWIZ_REG_BADC));
    __m512i vsum4 = _mm512_add_epi32(a,_mm512_swizzle_epi32(vsum3, _MM_SWIZ_REG_CADB));
    int32_t out[2];
    _mm512_packstorelo_epi32(out, vsum4);
    return out[0];
}

I don't see a difference between in functionality between KNC's _mm512_permute4f128_epi32(a,imm8) and AVX512's _mm512_shuffle_i32x4(a,a,imm8).

The main difference in this case is that _mm512_shuffle_epi32 generates vpshufd whereas _mm512_swizzle_epi32 does not. That appears to be an advantage of KNC over AVX512.