I am not good at SIMD, so, I need the help for converting this code to intrinsic code. In my opinion, it seems like C = A * B, but I am not sure. Can anybody help me? Also I want to ask whether the intrinsic functions are avaiable for mobile processor. In fact, the code below is for intel CPU, but my work is finally aimed for mobile device. Thanks in advance.
for (int i = 0; i < M; i++, C += N) {
float x = A[i];
_asm {
mov esi, N8;
sub esi, 8;
shl esi, 2;
xor edi, edi;
mov ebx, B;
mov edx, C;
vbroadcastss ymm7, x;
Lrep1:
cmp edi, esi;
jg Lexit1;
vmovups ymm0, ymmword ptr[ebx + edi];
vmulps ymm0, ymm0, ymm7;
vmovups ymmword ptr[edx + edi], ymm0;
add edi, 32;
jmp Lrep1;
Lexit1:
}
for (int j = N8; j < N; j++) C[j] = x * B[j];
}
You'd be far better off replacing the entire code with just:
float x = A[i];
for (int j = 0; j < N; j++) C[j] = x * B[j];
The compiler will do a far better job of optimising that than the somewhat naive attempt at asm optimisation presented above. Fire your co-worker :)
As for what it's doing, not a whole lot. It's just looping through the floats in batches of 8. As I said though, it's pretty stupid, and you'd be better off from a performance POV of using the standard C code above.
float x = A[i];
__m256 _x = _mm256_set1_ps(x);
for (int j = 0; j < N8; j += 8)
{
_mm256_storeu_ps(C + j, _mm256_mul_ps(_x, _mm256_loadu_ps(B + j)));
}
for (int j = N8; j < N; j++) C[j] = x * B[j];