operand type mismatch for `vpbroadcastd'

I tried to find a KNC broadcast instruction for Xeon Phi platform. But I could not find any instruction. Instead I tried to use this AVX _mm512_set1_epi32 intrinsic in assembly. I have two questions: first is there any KNC broadcast instruction? Second, when I compiled the below code, I got the operand type mismatch for `vpbroadcastd' error.

int op = 2;
__asm__("vmovdqa32 %0,%%zmm0\n\t"
            "mov %1, %%eax\n\t"
            "vpbroadcastd %%eax, %%zmm1\n\t"
            "vpsravd %%zmm1,%%zmm0,%%zmm1\n\t"
            "vmovdqa32 %%zmm1,%0;"
            : "=m" (tt[0]): "m" (op));

which tt defined using below code and I used k1om-mpss-linux-gcc compiler for compiling this code

int * tt = (int *) aligned_malloc(16 * sizeof(int),64);

Solution

I looked at how AVX2 would do this with intrinsics and noticed that the broadcast reads from memory just like with KNC. Looking at the assembly from the AVX2 intrinsics I wrote inline assembly which does the same thing.

#include <stdio.h>
#include <x86intrin.h>
void foo(int *A, int n) {
    __m256i a16 = _mm256_loadu_si256((__m256i*)A);
    __m256i t = _mm256_set1_epi32(n);
    __m256i s16 = _mm256_srav_epi32(a16,t);
    _mm256_storeu_si256((__m256i*)A, s16);
}

void foo2(int *A, int n) {
    __asm__("vmovdqu      (%0),%%ymm0\n"
            "vpbroadcastd (%1), %%ymm1\n"
            "vpsravd      %%ymm1, %%ymm0, %%ymm0\n"
            "vmovdqu      %%ymm0, (%0)"
            :
            : "r" (A), "r" (&n)
            : "memory"
        );
}

int main(void) {
    int x[8];
    for(int i=0; i<8; i++) x[i] = 1<<i;
    for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
    foo2(x,2);
    for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
}

Here is my guess for KNC (using aligned loads):

void foo2_KNC(int *A, int n) {
    __asm__("vmovdqa32      (%0),%%zmm0\n"
            "vpbroadcastd   (%1), %%zmm1\n"
            "vpsravd        %%zmm1, %%zmm0, %%zmm0\n"
            "vmovdqa32      %%zmm0, (%0)"
            :
            : "r" (A), "r" (&n)
            : "memory"
        );
}

There appears to be a more efficient way of doing this with KNC and AVX512.

Intel says in regards to AVX12 in section "2.5.3 Broadcast":

EVEX encoding provides a bit-field to encode data broadcast for some load-op instructions

and then gives the example

vmulps zmm1, zmm2, [rax] {1to16}

where

The {1to16} primitive loads one float32 (single precision) elem ent from memory, replicates it 16 times to form a vector of 16 32-bit floating-point elements, multiplies the 16 float32 elements with the corresponding elements in the first source operand vector, and put each of the 16 results into the destination operand.

I have never used his syntax before but you could try

void foo2_KNC(int *A, int n) {
__asm__("vmovdqa32      (%0),%%zmm0\n\t"
        "vpsravd        (%1)%{1to16}, %%zmm0, %%zmm0\n\t"
        "vmovdqa32      %%zmm0, (%0)\t"
        :
        : "r" (A), "r" (&n)
        : "memory", "%zmm0"
    );

}

this produces

vmovdqa32      (%rax),%zmm0
vpsravd        (%rdx){1to16}, %zmm0, %zmm0
vmovdqa32      %zmm0, (%rax)

Agner Fog incidentally has a section titled "8.4 Assembly syntax for AVX-512 and Knights Corner instructions" in the documentation for objconv where he says

these two instruction sets are very similar, but have different optional instruction attributes. Instructions from these two instruction sets differ by a single bit in the prefix, even for otherwise identical instructions.

According to his documentation NASM supports the AVX-512 and KNC syntax so you could try this syntax in NASM.