Search code examples
assemblybroadcastxeon-phi

operand type mismatch for `vpbroadcastd'


I tried to find a KNC broadcast instruction for Xeon Phi platform. But I could not find any instruction. Instead I tried to use this AVX _mm512_set1_epi32 intrinsic in assembly. I have two questions: first is there any KNC broadcast instruction? Second, when I compiled the below code, I got the operand type mismatch for `vpbroadcastd' error.

int op = 2;
__asm__("vmovdqa32 %0,%%zmm0\n\t"
            "mov %1, %%eax\n\t"
            "vpbroadcastd %%eax, %%zmm1\n\t"
            "vpsravd %%zmm1,%%zmm0,%%zmm1\n\t"
            "vmovdqa32 %%zmm1,%0;"
            : "=m" (tt[0]): "m" (op));

which tt defined using below code and I used k1om-mpss-linux-gcc compiler for compiling this code

int * tt = (int *) aligned_malloc(16 * sizeof(int),64);

Solution

  • I looked at how AVX2 would do this with intrinsics and noticed that the broadcast reads from memory just like with KNC. Looking at the assembly from the AVX2 intrinsics I wrote inline assembly which does the same thing.

    #include <stdio.h>
    #include <x86intrin.h>
    void foo(int *A, int n) {
        __m256i a16 = _mm256_loadu_si256((__m256i*)A);
        __m256i t = _mm256_set1_epi32(n);
        __m256i s16 = _mm256_srav_epi32(a16,t);
        _mm256_storeu_si256((__m256i*)A, s16);
    }
    
    void foo2(int *A, int n) {
        __asm__("vmovdqu      (%0),%%ymm0\n"
                "vpbroadcastd (%1), %%ymm1\n"
                "vpsravd      %%ymm1, %%ymm0, %%ymm0\n"
                "vmovdqu      %%ymm0, (%0)"
                :
                : "r" (A), "r" (&n)
                : "memory"
            );
    }
    
    int main(void) {
        int x[8];
        for(int i=0; i<8; i++) x[i] = 1<<i;
        for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
        foo2(x,2);
        for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
    }
    

    Here is my guess for KNC (using aligned loads):

    void foo2_KNC(int *A, int n) {
        __asm__("vmovdqa32      (%0),%%zmm0\n"
                "vpbroadcastd   (%1), %%zmm1\n"
                "vpsravd        %%zmm1, %%zmm0, %%zmm0\n"
                "vmovdqa32      %%zmm0, (%0)"
                :
                : "r" (A), "r" (&n)
                : "memory"
            );
    }
    

    There appears to be a more efficient way of doing this with KNC and AVX512.

    Intel says in regards to AVX12 in section "2.5.3 Broadcast":

    EVEX encoding provides a bit-field to encode data broadcast for some load-op instructions

    and then gives the example

    vmulps zmm1, zmm2, [rax] {1to16}
    

    where

    The {1to16} primitive loads one float32 (single precision) elem ent from memory, replicates it 16 times to form a vector of 16 32-bit floating-point elements, multiplies the 16 float32 elements with the corresponding elements in the first source operand vector, and put each of the 16 results into the destination operand.

    I have never used his syntax before but you could try

    void foo2_KNC(int *A, int n) {
    __asm__("vmovdqa32      (%0),%%zmm0\n\t"
            "vpsravd        (%1)%{1to16}, %%zmm0, %%zmm0\n\t"
            "vmovdqa32      %%zmm0, (%0)\t"
            :
            : "r" (A), "r" (&n)
            : "memory", "%zmm0"
        );
    

    }

    this produces

    vmovdqa32      (%rax),%zmm0
    vpsravd        (%rdx){1to16}, %zmm0, %zmm0
    vmovdqa32      %zmm0, (%rax)
    

    Agner Fog incidentally has a section titled "8.4 Assembly syntax for AVX-512 and Knights Corner instructions" in the documentation for objconv where he says

    these two instruction sets are very similar, but have different optional instruction attributes. Instructions from these two instruction sets differ by a single bit in the prefix, even for otherwise identical instructions.

    According to his documentation NASM supports the AVX-512 and KNC syntax so you could try this syntax in NASM.