Search code examples
csimdinline-assemblyintrinsics

How to write intinsic code for this inline assembly code?


I am not good at SIMD, so, I need the help for converting this code to intrinsic code. In my opinion, it seems like C = A * B, but I am not sure. Can anybody help me? Also I want to ask whether the intrinsic functions are avaiable for mobile processor. In fact, the code below is for intel CPU, but my work is finally aimed for mobile device. Thanks in advance.

for (int i = 0; i < M; i++, C += N) {
    float x = A[i];
    _asm {
        mov             esi, N8;
        sub             esi, 8;
        shl             esi, 2;
        xor             edi, edi;
        mov             ebx, B;
        mov             edx, C;
        vbroadcastss    ymm7, x;
    Lrep1:
        cmp             edi, esi;
        jg              Lexit1;
        vmovups         ymm0, ymmword ptr[ebx + edi];
        vmulps          ymm0, ymm0, ymm7;
        vmovups         ymmword ptr[edx + edi], ymm0;
        add             edi, 32;
        jmp             Lrep1;

    Lexit1:
    }
    for (int j = N8; j < N; j++) C[j] = x * B[j];
}

Solution

  • You'd be far better off replacing the entire code with just:

    float x = A[i];
    for (int j = 0; j < N; j++) C[j] = x * B[j];
    

    The compiler will do a far better job of optimising that than the somewhat naive attempt at asm optimisation presented above. Fire your co-worker :)

    As for what it's doing, not a whole lot. It's just looping through the floats in batches of 8. As I said though, it's pretty stupid, and you'd be better off from a performance POV of using the standard C code above.

    float x = A[i];
    __m256 _x = _mm256_set1_ps(x);
    for (int j = 0; j < N8; j += 8) 
    {
      _mm256_storeu_ps(C + j, _mm256_mul_ps(_x, _mm256_loadu_ps(B + j)));
    }
    for (int j = N8; j < N; j++) C[j] = x * B[j];