c++assembly x86-64 avx512 auto-vectorization

AVX512 auto-vectorized C++ matrix-vector functions are much slower when source = destination, in-place

I've tried to write a few functions to carry out matrix-vector multiplication using a single matrix together with an array of source vectors. I've once written those functions in C++ and once in x86 AVX512 Assembly to compare the performance with the Intel VTune Profiler. When utilizing the source vector array as the destination array the Assembly variants perform 3.5x to 10x faster than the C++ counterpart, but when utilizing different source and destination arrays the Assembly variants barely perform better than the C++ counterpart achieving almost the same performance... sometimes even minimally worse.

Another thing which I can't understand is, why the C++ counterparts can even get near the same or better performance level than the Assembly variants when using different source and destination arrays even though the Assembly code is much shorter and also according to the static analysis tools uica and llvm-mca multiple times faster. uica.uops.info

I wouldn't want to make this post too long as it already will be, so I'm posting only the code for the functions to carry out mat4-vec4 multiplication.

This is the code for the Assembly variant, which assumes the matrix to be transposed:

alignas(64) uint32_t mat4_mul_vec4_avx512_vpermps_index[64]{    0, 0, 0, 0, 4, 4, 4, 4, 8, 8, 8, 8, 12, 12, 12, 12,
                                                            1, 1, 1, 1, 5, 5, 5, 5, 9, 9, 9, 9, 13, 13, 13, 13,
                                                            2, 2, 2, 2, 6, 6, 6, 6, 10, 10, 10, 10, 14, 14, 14, 14,
                                                            3, 3, 3, 3, 7, 7, 7, 7, 11, 11, 11, 11, 15, 15, 15, 15 };

void __declspec(naked, align(64)) mat4_mul_vec4_avx512(vec4f_t* destination, const mat4f_t& src1, const vec4f_t* src2, uint32_t vector_count) {
__asm {
    vbroadcastf32x4 zmm16, xmmword ptr[rdx]
    vbroadcastf32x4 zmm17, xmmword ptr[rdx + 16]

    vbroadcastf32x4 zmm18, xmmword ptr[rdx + 32]
    vbroadcastf32x4 zmm19, xmmword ptr[rdx + 48]

    vmovups zmm20, zmmword ptr[mat4_mul_vec4_avx512_vpermps_index]
    vmovups zmm21, zmmword ptr[mat4_mul_vec4_avx512_vpermps_index + 64]

    vmovups zmm22, zmmword ptr[mat4_mul_vec4_avx512_vpermps_index + 128]
    vmovups zmm23, zmmword ptr[mat4_mul_vec4_avx512_vpermps_index + 192]

    vmovups zmm24, zmmword ptr[r8]

    vpermps zmm25, zmm20, zmm24
    vpermps zmm26, zmm21, zmm24
    vpermps zmm27, zmm22, zmm24
    vpermps zmm28, zmm23, zmm24

    xor eax, eax

    align 32
    mat4_mul_vec4_avx512_loop:

        vmovups zmm24, zmmword ptr[r8+rax+64]

        vmulps zmm29, zmm16, zmm25
        vpermps zmm25, zmm20, zmm24

        vfmadd231ps zmm29, zmm17, zmm26
        vpermps zmm26, zmm21, zmm24

        vfmadd231ps zmm29, zmm18, zmm27
        vpermps zmm27, zmm22, zmm24

        vfmadd231ps zmm29, zmm19, zmm28
        vpermps zmm28, zmm23, zmm24

        vmovups zmmword ptr[rcx+rax], zmm29

    add rax, 64

    sub r9, 4
    jnz mat4_mul_vec4_avx512_loop

    ret
    }
}

This is the C++ variant, which assumes the matrix to not be transposed:

void mat4_mul_vec4_cpp(vec4f_t* destination, const mat4f_t& src1, const vec4f_t* src2, uint32_t vector_count) {
for (uint32_t i0{}; i0 < vector_count; ++i0) {
    destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
    destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
    destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
    destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
    }
}

Which resulted in the following Assembly code by the Intel C++ Compiler:

00007FF69F123D50  sub         rsp,38h  
00007FF69F123D54  vmovaps     xmmword ptr [rsp+20h],xmm8  
00007FF69F123D5A  vmovaps     xmmword ptr [rsp+10h],xmm7  
00007FF69F123D60  vmovaps     xmmword ptr [rsp],xmm6  
        destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123D65  vmovss      xmm0,dword ptr [rdx]  
00007FF69F123D69  vmovss      xmm1,dword ptr [rdx+4]  
00007FF69F123D6E  vmovss      xmm2,dword ptr [rdx+8]  
00007FF69F123D73  vmovss      xmm3,dword ptr [rdx+0Ch]  
        destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123D78  vmovss      xmm4,dword ptr [rdx+10h]  
00007FF69F123D7D  vmovss      xmm5,dword ptr [rdx+14h]  
00007FF69F123D82  vmovss      xmm16,dword ptr [rdx+18h]  
00007FF69F123D89  vmovss      xmm17,dword ptr [rdx+1Ch]  
        destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123D90  vmovss      xmm18,dword ptr [rdx+20h]  
00007FF69F123D97  vmovss      xmm19,dword ptr [rdx+24h]  
00007FF69F123D9E  vmovss      xmm20,dword ptr [rdx+28h]  
00007FF69F123DA5  vmovss      xmm21,dword ptr [rdx+2Ch]  
        destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123DAC  vmovss      xmm22,dword ptr [rdx+30h]  
00007FF69F123DB3  vmovss      xmm23,dword ptr [rdx+34h]  
00007FF69F123DBA  vmovss      xmm24,dword ptr [rdx+38h]  
00007FF69F123DC1  vmovss      xmm25,dword ptr [rdx+3Ch]  
    for (uint32_t i0{}; i0 < vector_count; ++i0) {
00007FF69F123DC8  lea         rax,[r8+3A9800h]  
00007FF69F123DCF  cmp         rax,rcx  
00007FF69F123DD2  jbe         mat4_mul_vec4_cpp+150h (07FF69F123EA0h)  
00007FF69F123DD8  lea         rax,[rcx+3A9800h]  
00007FF69F123DDF  cmp         rax,r8  
00007FF69F123DE2  jbe         mat4_mul_vec4_cpp+150h (07FF69F123EA0h)  
00007FF69F123DE8  mov         eax,0Ch  
00007FF69F123DED  nop         dword ptr [rax]  
        destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123DF0  vmulss      xmm26,xmm0,dword ptr [r8+rax-0Ch]  
00007FF69F123DF8  vfmadd231ss xmm26,xmm1,dword ptr [r8+rax-8]  
00007FF69F123E00  vfmadd231ss xmm26,xmm2,dword ptr [r8+rax-4]  
00007FF69F123E08  vfmadd231ss xmm26,xmm3,dword ptr [r8+rax]  
00007FF69F123E0F  vmovss      dword ptr [rcx+rax-0Ch],xmm26  
        destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123E17  vmulss      xmm26,xmm4,dword ptr [r8+rax-0Ch]  
00007FF69F123E1F  vfmadd231ss xmm26,xmm5,dword ptr [r8+rax-8]  
00007FF69F123E27  vfmadd231ss xmm26,xmm16,dword ptr [r8+rax-4]  
00007FF69F123E2F  vfmadd231ss xmm26,xmm17,dword ptr [r8+rax]  
00007FF69F123E36  vmovss      dword ptr [rcx+rax-8],xmm26  
        destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123E3E  vmulss      xmm26,xmm18,dword ptr [r8+rax-0Ch]  
00007FF69F123E46  vfmadd231ss xmm26,xmm19,dword ptr [r8+rax-8]  
00007FF69F123E4E  vfmadd231ss xmm26,xmm20,dword ptr [r8+rax-4]  
00007FF69F123E56  vfmadd231ss xmm26,xmm21,dword ptr [r8+rax]  
00007FF69F123E5D  vmovss      dword ptr [rcx+rax-4],xmm26  
        destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123E65  vmulss      xmm26,xmm22,dword ptr [r8+rax-0Ch]  
00007FF69F123E6D  vfmadd231ss xmm26,xmm23,dword ptr [r8+rax-8]  
00007FF69F123E75  vfmadd231ss xmm26,xmm24,dword ptr [r8+rax-4]  
00007FF69F123E7D  vfmadd231ss xmm26,xmm25,dword ptr [r8+rax]  
00007FF69F123E84  vmovss      dword ptr [rcx+rax],xmm26  
    for (uint32_t i0{}; i0 < vector_count; ++i0) {
00007FF69F123E8B  add         rax,10h  
00007FF69F123E8F  cmp         rax,3A980Ch  
00007FF69F123E95  jne         mat4_mul_vec4_cpp+0A0h (07FF69F123DF0h)  
00007FF69F123E9B  jmp         mat4_mul_vec4_cpp+2FEh (07FF69F12404Eh)  
00007FF69F123EA0  vbroadcastss ymm0,xmm0  
00007FF69F123EA5  vbroadcastss ymm1,xmm1  
00007FF69F123EAA  vbroadcastss ymm2,xmm2  
00007FF69F123EAF  vbroadcastss ymm3,xmm3  
00007FF69F123EB4  vbroadcastss ymm4,xmm4  
00007FF69F123EB9  vbroadcastss ymm5,xmm5  
00007FF69F123EBE  vbroadcastss ymm16,xmm16  
00007FF69F123EC4  vbroadcastss ymm17,xmm17  
00007FF69F123ECA  vbroadcastss ymm18,xmm18  
00007FF69F123ED0  vbroadcastss ymm19,xmm19  
00007FF69F123ED6  vbroadcastss ymm20,xmm20  
00007FF69F123EDC  vbroadcastss ymm21,xmm21  
00007FF69F123EE2  vbroadcastss ymm22,xmm22  
00007FF69F123EE8  vbroadcastss ymm23,xmm23  
00007FF69F123EEE  vbroadcastss ymm24,xmm24  
00007FF69F123EF4  vbroadcastss ymm25,xmm25  
00007FF69F123EFA  xor         eax,eax  
        destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123EFC  vmovups     xmm26,xmmword ptr [r8+rax]  
00007FF69F123F03  vmovups     xmm27,xmmword ptr [r8+rax+10h]  
00007FF69F123F0B  vmovups     xmm28,xmmword ptr [r8+rax+20h]  
00007FF69F123F13  vmovups     xmm29,xmmword ptr [r8+rax+30h]  
00007FF69F123F1B  vinsertf32x4 ymm26,ymm26,xmmword ptr [r8+rax+40h],1  
00007FF69F123F24  vinsertf32x4 ymm27,ymm27,xmmword ptr [r8+rax+50h],1  
00007FF69F123F2D  vinsertf32x4 ymm28,ymm28,xmmword ptr [r8+rax+60h],1  
00007FF69F123F36  vinsertf32x4 ymm29,ymm29,xmmword ptr [r8+rax+70h],1  
00007FF69F123F3F  vshufps     ymm30,ymm26,ymm27,14h  
00007FF69F123F46  vshufps     ymm31,ymm29,ymm28,41h  
00007FF69F123F4D  vshufps     ymm6,ymm30,ymm31,6Ch  
00007FF69F123F54  vmulps      ymm7,ymm6,ymm0  
        destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123F58  vmulps      ymm8,ymm6,ymm4  
        destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123F5C  vshufps     ymm30,ymm30,ymm31,39h  
        destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123F63  vmulps      ymm31,ymm6,ymm18  
        destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123F69  vmulps      ymm6,ymm6,ymm22  
        destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123F6F  vfmadd231ps ymm7,ymm30,ymm1  
        destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123F75  vfmadd231ps ymm8,ymm30,ymm5  
        destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123F7B  vfmadd231ps ymm31,ymm30,ymm19  
        destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123F81  vfmadd231ps ymm6,ymm23,ymm30  
        destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123F87  vshufps     ymm26,ymm26,ymm27,0BEh  
00007FF69F123F8E  vshufps     ymm27,ymm29,ymm28,0EBh  
00007FF69F123F95  vshufps     ymm28,ymm26,ymm27,6Ch  
00007FF69F123F9C  vfmadd231ps ymm7,ymm28,ymm2  
        destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123FA2  vfmadd231ps ymm8,ymm28,ymm16  
        destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123FA8  vfmadd231ps ymm31,ymm28,ymm20  
        destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123FAE  vfmadd231ps ymm6,ymm24,ymm28  
        destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123FB4  vshufps     ymm26,ymm26,ymm27,39h  
00007FF69F123FBB  vfmadd231ps ymm7,ymm26,ymm3  
        destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123FC1  vfmadd231ps ymm8,ymm26,ymm17  
        destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123FC7  vfmadd231ps ymm31,ymm26,ymm21  
        destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123FCD  vfmadd231ps ymm6,ymm25,ymm26  
00007FF69F123FD3  vpunpckldq  ymm26,ymm7,ymm8  
00007FF69F123FD9  vpunpckldq  ymm27,ymm31,ymm6  
00007FF69F123FDF  vpunpckhdq  ymm28,ymm7,ymm8  
00007FF69F123FE5  vpunpckhdq  ymm29,ymm31,ymm6  
00007FF69F123FEB  vpunpcklqdq ymm30,ymm26,ymm27  
00007FF69F123FF1  vpunpckhqdq ymm26,ymm26,ymm27  
00007FF69F123FF7  vpunpcklqdq ymm27,ymm28,ymm29  
00007FF69F123FFD  vpunpckhqdq ymm28,ymm28,ymm29  
00007FF69F124003  vinsertf32x4 ymm29,ymm30,xmm26,1  
00007FF69F12400A  vmovups     ymmword ptr [rcx+rax],ymm29  
00007FF69F124011  vinsertf32x4 ymm29,ymm27,xmm28,1  
00007FF69F124018  vmovups     ymmword ptr [rcx+rax+20h],ymm29  
00007FF69F124020  vshuff64x2  ymm26,ymm30,ymm26,3  
00007FF69F124027  vmovupd     ymmword ptr [rcx+rax+40h],ymm26  
00007FF69F12402F  vshuff64x2  ymm26,ymm27,ymm28,3  
00007FF69F124036  vmovupd     ymmword ptr [rcx+rax+60h],ymm26  
    for (uint32_t i0{}; i0 < vector_count; ++i0) {
00007FF69F12403E  sub         rax,0FFFFFFFFFFFFFF80h  
00007FF69F124042  cmp         rax,3A9800h  
00007FF69F124048  jne         mat4_mul_vec4_cpp+1ACh (07FF69F123EFCh)  
    }
}
00007FF69F12404E  vmovaps     xmm6,xmmword ptr [rsp]  
00007FF69F124053  vmovaps     xmm7,xmmword ptr [rsp+10h]  
00007FF69F124059  vmovaps     xmm8,xmmword ptr [rsp+20h]  
00007FF69F12405F  add         rsp,38h  
00007FF69F124063  vzeroupper  
00007FF69F124066  ret

This is the calling code for the benchmark, when using the same source and destination vector array:

int main() {
    SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);

    SetThreadPriorityBoost(GetCurrentThread(), false);
    SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);

    mat4f_t matrix{ {0, 1, 2, 3,
                    4, 5, 6, 7,
                    8, 9, 10, 11,
                    12, 13, 14, 15} };

    vec4f_t* dst_vector = new vec4f_t[1'000'000]{};
    vec4f_t* src_vector = new vec4f_t[1'000'000]{};

    vec3f_t* dst_vector0 = new vec3f_t[1'000'000]{};
    vec3f_t* src_vector0 = new vec3f_t[1'000'000]{};

    for (uint32_t i0{}; i0 < 1000000; ++i0) {
        src_vector[i0] = vec4f_t{ (float)i0, (float)i0, (float)i0, (float)i0 };
        src_vector0[i0] = vec3f_t{ (float)i0, (float)i0, (float)i0 };
    }

    set_mxcsr0(0b1111'1111'1100'0000);

    for (uint64_t i0{}; i0 < 30000; ++i0) {
        for (uint32_t i1{}; i1 < 16; ++i1) {
            reinterpret_cast<float*>(matrix.element_)[i1] = 1. / i0;
        }
        mat4_mul_vec4_avx512(src_vector, matrix, src_vector, 240000);
        mat4_mul_vec4_cpp(src_vector, matrix, src_vector, 240000);
        mat4_mul_vec3_avx512(src_vector0, matrix, src_vector0, 240000);
        mat4_mul_vec3_cpp(src_vector0, matrix, src_vector0, 240000);
        fps();
    }

    for (uint32_t i0{}; i0 < 1000000; ++i0) {
        std::cout << src_vector0[i0] << std::endl;
        std::cout << src_vector[i0] << std::endl;
    }
}

When testing with different source and destination arrays src_vector/src_vector0 as the first parameter would be replaced by dst_vector/dst_vector0.

These are the benchmark results when using the same source and destination arrays:

Same source/destination arrays, Assembly performs much better

These are the benchmark results when using different source and destination arrays:

Different source/destination arrays, Assembly performs nearly the same as C++

The benchmarks have been created on a machine with Windows 11 with an 11th Gen i7-11850H Tiger Lake CPU using the Intel C++ Compiler 2024 and, as already mentioned, the Intel VTune Profiler.

There are a few things which I don't understand:

Why does the speed vary when using different/the same source and destination arrays? Does it have anything to do with the cache in this case?
Why does the C++ counterpart even reach such good performance near the Assembly variant when using different source and destination arrays?

Thank you for your help. I really appreciate that.

Solution

Your first VTune screenshot (the source=dest version) shows the cpp version running 100% scalar ops (far right column), vs. the separate dest case running 100% packed SIMD FP ops.

It's common for auto-vectorization to check for overlap and fall back to a scalar loop if an output overlaps any input, not making a third version that's specialized for dst=src (e.g. perhaps loading everything first, if that still matches the C++ semantics for that case). So it's probably using the scalar fallback because its checks detected overlap. (In code that always has disjoint dst and src, you'd use __restrict to promise that and optimize away those checks. It might happen to compile to asm that works in the perfect overlap dst=src case, but that's still UB so don't do it, it could break in future or with different surrounding code to inline into.)

If your source special-cases the dst=src case, that could get auto-vectorized.

also according to the static analysis tools uica and llvm-mca multiple times faster. uica.uops.info

It's not that extreme; uiCA predicts a 1.75x speedup for your asm vs. LLVM's on Tiger Lake.

uiCA for the C++ auto-vectorized inner loop - predicts 14.14 cycles per iteration (128 bytes of data = 8x float4) using YMM vectors. With perfect scheduling, a back-end port throughput bottleneck of 13.33 cycles, but apparently it predicts too many of the shuffles will get scheduled to port 1 and steal cycles from the MULs/FMAs.
54 front-end uops / iteration = 6.75 per float4
uiCA for the asm loop - predicts 4.02 cycles per iteration (64 bytes = 4x float4). As expected, if loads/stores aren't a bottleneck, out-of-order exec has no problem overlapping work across iterations and saturating port 0 with FMAs, port 5 with shuffles. The indexed addressing-mode isn't a problem since you don't use it as a memory operand for an ALU instruction, just for pure-load / pure-store.
12 total (fused-domain) uops per iter = 3 per float4, so more hyperthreading-friendly if the other logical core has some scalar integer work that can run on ports 1 and 6, or while this thread stalls on load/store bandwidth.

So in the best-case no-stall case, 14.14/2 / 4.02 = 1.756 speed ratio in cycles per float4. Maybe you missed the work per iteration factor (look at the pointer increments and addressing mode offsets), or you copy/pasted the whole function instead of just the inner loop? uiCA treats the asm as a loop body whether it ends with a branch or not, assuming any earlier branches are not-taken.

Memory / cache bandwidth bottlenecks might be the limiting factor, leaving both running at about the same speed? For equivalent code (not like this with tons of extra shuffles with 256-bit vectors), unaligned data can be maybe 15% worse DRAM bandwidth on Skylake Xeon with 512-bit vectors than with 256-bit vectors (where only every other load/store is a cache-line split.)

Source = destination is better for cache footprint since you're only touching half as much data. And better for memory bandwidth: without NT stores, a write costs a read+write if the data isn't already hot in cache, include a Read For Ownership to get the old value of a cache line and exclusive ownership. (I'm not sure if there are any optimizations for an aligned store of a whole 64-byte cache line, like whether that can just invalidate other copies without doing an RFO but without bypassing / evicting from cache like NT stores, instead like how rep movsb / rep stosb microcode works.) Anyway, if DRAM bandwidth was a bottleneck, you'd expect in-place to be 1.5x as fast vs. copy, modulo other factors like mixed read+write having some overhead for the DRAM itself.

So the in-place case might be unshackling your asm from bandwidth bottlenecks, as well as causing the C++ to face-plant as it runs scalar code, magnifying the difference.

It looks like your code uses the same matrix every time, but with different vectors. And you load 4 float4 vectors as one ZMM vector.

With software pipelining of the load+shuffle ahead of the MUL+FMAs, which is really not necessary, all CPUs with AVX-512 have pretty deep out-of-order execution buffers to find instruction-level parallelism across iterations. It's a fairly short dependency chain.

Also, you load + shuffle a final vector but don't FMA + store it. You could peel a final set of MUL/FMA + store out of the end of the last iteration, and have your loop counter do one fewer iteration.

For Zen 4, you might use cmp/jne on the pointer instead of sub/jnz on a separate counter. Zen can macro-fuse cmp/jcc but not sub/jcc.

It looks like ICX unrolled to do 128 bytes of data per iteration, but it is doing a lot of shuffle uops per input vector. One vpermps (or vpermt2ps if it needs to pull data from 2x YMM instead of 1x ZMM with the default -mprefer-vector-width=256) per multiply should be better, but it's not inventing shuffle constants other than immediates for vshufps.

It looks like it broadcasted each scalar element of the matrix to a separate YMM register, with 16x vbroadcastss instructions. So this requires shuffles to line up the results.

Even with "client" chips having 1/clock ZMM FMA/mul/add but 2/clock YMM, I'd have expected your asm to be faster due to both front-end and back-end execution-port throughput bottlenecks from all those shuffles. Even if your data isn't aligned by 64, so every load and store is a cache-line split. (512-bit vectors make aligned data more important than with 256-bit.) Even with possible clock-speed penalties from using 512-bit vectors.

You probably didn't need to write this in asm. You could probably get about the same asm with intrinsics like _mm512_loadu_ps and _mm512_fmadd_ps. asm{} does let you align 32 for that one specific loop to help out the uop cache, but Tiger Lake has a working LSD (loop buffer). (Skylake-X only has its uop cache, no LSD, and it has the JCC erratum creating even more code-alignment potholes for the bottom of loops.)