I've tried to write a few functions to carry out matrix-vector multiplication using a single matrix together with an array of source vectors. I've once written those functions in C++ and once in x86 AVX512 Assembly to compare the performance with the Intel VTune Profiler. When utilizing the source vector array as the destination array the Assembly variants perform 3.5x to 10x faster than the C++ counterpart, but when utilizing different source and destination arrays the Assembly variants barely perform better than the C++ counterpart achieving almost the same performance... sometimes even minimally worse.
Another thing which I can't understand is, why the C++ counterparts can even get near the same or better performance level than the Assembly variants when using different source and destination arrays even though the Assembly code is much shorter and also according to the static analysis tools uica and llvm-mca multiple times faster. uica.uops.info
I wouldn't want to make this post too long as it already will be, so I'm posting only the code for the functions to carry out mat4-vec4 multiplication.
This is the code for the Assembly variant, which assumes the matrix to be transposed:
alignas(64) uint32_t mat4_mul_vec4_avx512_vpermps_index[64]{ 0, 0, 0, 0, 4, 4, 4, 4, 8, 8, 8, 8, 12, 12, 12, 12,
1, 1, 1, 1, 5, 5, 5, 5, 9, 9, 9, 9, 13, 13, 13, 13,
2, 2, 2, 2, 6, 6, 6, 6, 10, 10, 10, 10, 14, 14, 14, 14,
3, 3, 3, 3, 7, 7, 7, 7, 11, 11, 11, 11, 15, 15, 15, 15 };
void __declspec(naked, align(64)) mat4_mul_vec4_avx512(vec4f_t* destination, const mat4f_t& src1, const vec4f_t* src2, uint32_t vector_count) {
__asm {
vbroadcastf32x4 zmm16, xmmword ptr[rdx]
vbroadcastf32x4 zmm17, xmmword ptr[rdx + 16]
vbroadcastf32x4 zmm18, xmmword ptr[rdx + 32]
vbroadcastf32x4 zmm19, xmmword ptr[rdx + 48]
vmovups zmm20, zmmword ptr[mat4_mul_vec4_avx512_vpermps_index]
vmovups zmm21, zmmword ptr[mat4_mul_vec4_avx512_vpermps_index + 64]
vmovups zmm22, zmmword ptr[mat4_mul_vec4_avx512_vpermps_index + 128]
vmovups zmm23, zmmword ptr[mat4_mul_vec4_avx512_vpermps_index + 192]
vmovups zmm24, zmmword ptr[r8]
vpermps zmm25, zmm20, zmm24
vpermps zmm26, zmm21, zmm24
vpermps zmm27, zmm22, zmm24
vpermps zmm28, zmm23, zmm24
xor eax, eax
align 32
mat4_mul_vec4_avx512_loop:
vmovups zmm24, zmmword ptr[r8+rax+64]
vmulps zmm29, zmm16, zmm25
vpermps zmm25, zmm20, zmm24
vfmadd231ps zmm29, zmm17, zmm26
vpermps zmm26, zmm21, zmm24
vfmadd231ps zmm29, zmm18, zmm27
vpermps zmm27, zmm22, zmm24
vfmadd231ps zmm29, zmm19, zmm28
vpermps zmm28, zmm23, zmm24
vmovups zmmword ptr[rcx+rax], zmm29
add rax, 64
sub r9, 4
jnz mat4_mul_vec4_avx512_loop
ret
}
}
This is the C++ variant, which assumes the matrix to not be transposed:
void mat4_mul_vec4_cpp(vec4f_t* destination, const mat4f_t& src1, const vec4f_t* src2, uint32_t vector_count) {
for (uint32_t i0{}; i0 < vector_count; ++i0) {
destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
}
}
Which resulted in the following Assembly code by the Intel C++ Compiler:
00007FF69F123D50 sub rsp,38h
00007FF69F123D54 vmovaps xmmword ptr [rsp+20h],xmm8
00007FF69F123D5A vmovaps xmmword ptr [rsp+10h],xmm7
00007FF69F123D60 vmovaps xmmword ptr [rsp],xmm6
destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123D65 vmovss xmm0,dword ptr [rdx]
00007FF69F123D69 vmovss xmm1,dword ptr [rdx+4]
00007FF69F123D6E vmovss xmm2,dword ptr [rdx+8]
00007FF69F123D73 vmovss xmm3,dword ptr [rdx+0Ch]
destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123D78 vmovss xmm4,dword ptr [rdx+10h]
00007FF69F123D7D vmovss xmm5,dword ptr [rdx+14h]
00007FF69F123D82 vmovss xmm16,dword ptr [rdx+18h]
00007FF69F123D89 vmovss xmm17,dword ptr [rdx+1Ch]
destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123D90 vmovss xmm18,dword ptr [rdx+20h]
00007FF69F123D97 vmovss xmm19,dword ptr [rdx+24h]
00007FF69F123D9E vmovss xmm20,dword ptr [rdx+28h]
00007FF69F123DA5 vmovss xmm21,dword ptr [rdx+2Ch]
destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123DAC vmovss xmm22,dword ptr [rdx+30h]
00007FF69F123DB3 vmovss xmm23,dword ptr [rdx+34h]
00007FF69F123DBA vmovss xmm24,dword ptr [rdx+38h]
00007FF69F123DC1 vmovss xmm25,dword ptr [rdx+3Ch]
for (uint32_t i0{}; i0 < vector_count; ++i0) {
00007FF69F123DC8 lea rax,[r8+3A9800h]
00007FF69F123DCF cmp rax,rcx
00007FF69F123DD2 jbe mat4_mul_vec4_cpp+150h (07FF69F123EA0h)
00007FF69F123DD8 lea rax,[rcx+3A9800h]
00007FF69F123DDF cmp rax,r8
00007FF69F123DE2 jbe mat4_mul_vec4_cpp+150h (07FF69F123EA0h)
00007FF69F123DE8 mov eax,0Ch
00007FF69F123DED nop dword ptr [rax]
destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123DF0 vmulss xmm26,xmm0,dword ptr [r8+rax-0Ch]
00007FF69F123DF8 vfmadd231ss xmm26,xmm1,dword ptr [r8+rax-8]
00007FF69F123E00 vfmadd231ss xmm26,xmm2,dword ptr [r8+rax-4]
00007FF69F123E08 vfmadd231ss xmm26,xmm3,dword ptr [r8+rax]
00007FF69F123E0F vmovss dword ptr [rcx+rax-0Ch],xmm26
destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123E17 vmulss xmm26,xmm4,dword ptr [r8+rax-0Ch]
00007FF69F123E1F vfmadd231ss xmm26,xmm5,dword ptr [r8+rax-8]
00007FF69F123E27 vfmadd231ss xmm26,xmm16,dword ptr [r8+rax-4]
00007FF69F123E2F vfmadd231ss xmm26,xmm17,dword ptr [r8+rax]
00007FF69F123E36 vmovss dword ptr [rcx+rax-8],xmm26
destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123E3E vmulss xmm26,xmm18,dword ptr [r8+rax-0Ch]
00007FF69F123E46 vfmadd231ss xmm26,xmm19,dword ptr [r8+rax-8]
00007FF69F123E4E vfmadd231ss xmm26,xmm20,dword ptr [r8+rax-4]
00007FF69F123E56 vfmadd231ss xmm26,xmm21,dword ptr [r8+rax]
00007FF69F123E5D vmovss dword ptr [rcx+rax-4],xmm26
destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123E65 vmulss xmm26,xmm22,dword ptr [r8+rax-0Ch]
00007FF69F123E6D vfmadd231ss xmm26,xmm23,dword ptr [r8+rax-8]
00007FF69F123E75 vfmadd231ss xmm26,xmm24,dword ptr [r8+rax-4]
00007FF69F123E7D vfmadd231ss xmm26,xmm25,dword ptr [r8+rax]
00007FF69F123E84 vmovss dword ptr [rcx+rax],xmm26
for (uint32_t i0{}; i0 < vector_count; ++i0) {
00007FF69F123E8B add rax,10h
00007FF69F123E8F cmp rax,3A980Ch
00007FF69F123E95 jne mat4_mul_vec4_cpp+0A0h (07FF69F123DF0h)
00007FF69F123E9B jmp mat4_mul_vec4_cpp+2FEh (07FF69F12404Eh)
00007FF69F123EA0 vbroadcastss ymm0,xmm0
00007FF69F123EA5 vbroadcastss ymm1,xmm1
00007FF69F123EAA vbroadcastss ymm2,xmm2
00007FF69F123EAF vbroadcastss ymm3,xmm3
00007FF69F123EB4 vbroadcastss ymm4,xmm4
00007FF69F123EB9 vbroadcastss ymm5,xmm5
00007FF69F123EBE vbroadcastss ymm16,xmm16
00007FF69F123EC4 vbroadcastss ymm17,xmm17
00007FF69F123ECA vbroadcastss ymm18,xmm18
00007FF69F123ED0 vbroadcastss ymm19,xmm19
00007FF69F123ED6 vbroadcastss ymm20,xmm20
00007FF69F123EDC vbroadcastss ymm21,xmm21
00007FF69F123EE2 vbroadcastss ymm22,xmm22
00007FF69F123EE8 vbroadcastss ymm23,xmm23
00007FF69F123EEE vbroadcastss ymm24,xmm24
00007FF69F123EF4 vbroadcastss ymm25,xmm25
00007FF69F123EFA xor eax,eax
destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123EFC vmovups xmm26,xmmword ptr [r8+rax]
00007FF69F123F03 vmovups xmm27,xmmword ptr [r8+rax+10h]
00007FF69F123F0B vmovups xmm28,xmmword ptr [r8+rax+20h]
00007FF69F123F13 vmovups xmm29,xmmword ptr [r8+rax+30h]
00007FF69F123F1B vinsertf32x4 ymm26,ymm26,xmmword ptr [r8+rax+40h],1
00007FF69F123F24 vinsertf32x4 ymm27,ymm27,xmmword ptr [r8+rax+50h],1
00007FF69F123F2D vinsertf32x4 ymm28,ymm28,xmmword ptr [r8+rax+60h],1
00007FF69F123F36 vinsertf32x4 ymm29,ymm29,xmmword ptr [r8+rax+70h],1
00007FF69F123F3F vshufps ymm30,ymm26,ymm27,14h
00007FF69F123F46 vshufps ymm31,ymm29,ymm28,41h
00007FF69F123F4D vshufps ymm6,ymm30,ymm31,6Ch
00007FF69F123F54 vmulps ymm7,ymm6,ymm0
destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123F58 vmulps ymm8,ymm6,ymm4
destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123F5C vshufps ymm30,ymm30,ymm31,39h
destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123F63 vmulps ymm31,ymm6,ymm18
destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123F69 vmulps ymm6,ymm6,ymm22
destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123F6F vfmadd231ps ymm7,ymm30,ymm1
destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123F75 vfmadd231ps ymm8,ymm30,ymm5
destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123F7B vfmadd231ps ymm31,ymm30,ymm19
destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123F81 vfmadd231ps ymm6,ymm23,ymm30
destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123F87 vshufps ymm26,ymm26,ymm27,0BEh
00007FF69F123F8E vshufps ymm27,ymm29,ymm28,0EBh
00007FF69F123F95 vshufps ymm28,ymm26,ymm27,6Ch
00007FF69F123F9C vfmadd231ps ymm7,ymm28,ymm2
destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123FA2 vfmadd231ps ymm8,ymm28,ymm16
destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123FA8 vfmadd231ps ymm31,ymm28,ymm20
destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123FAE vfmadd231ps ymm6,ymm24,ymm28
destination[i0].element_[0] = src1.element_[0][0] * src2[i0].element_[0] + src1.element_[0][1] * src2[i0].element_[1] + src1.element_[0][2] * src2[i0].element_[2] + src1.element_[0][3] * src2[i0].element_[3];
00007FF69F123FB4 vshufps ymm26,ymm26,ymm27,39h
00007FF69F123FBB vfmadd231ps ymm7,ymm26,ymm3
destination[i0].element_[1] = src1.element_[1][0] * src2[i0].element_[0] + src1.element_[1][1] * src2[i0].element_[1] + src1.element_[1][2] * src2[i0].element_[2] + src1.element_[1][3] * src2[i0].element_[3];
00007FF69F123FC1 vfmadd231ps ymm8,ymm26,ymm17
destination[i0].element_[2] = src1.element_[2][0] * src2[i0].element_[0] + src1.element_[2][1] * src2[i0].element_[1] + src1.element_[2][2] * src2[i0].element_[2] + src1.element_[2][3] * src2[i0].element_[3];
00007FF69F123FC7 vfmadd231ps ymm31,ymm26,ymm21
destination[i0].element_[3] = src1.element_[3][0] * src2[i0].element_[0] + src1.element_[3][1] * src2[i0].element_[1] + +src1.element_[3][2] * src2[i0].element_[2] + src1.element_[3][3] * src2[i0].element_[3];
00007FF69F123FCD vfmadd231ps ymm6,ymm25,ymm26
00007FF69F123FD3 vpunpckldq ymm26,ymm7,ymm8
00007FF69F123FD9 vpunpckldq ymm27,ymm31,ymm6
00007FF69F123FDF vpunpckhdq ymm28,ymm7,ymm8
00007FF69F123FE5 vpunpckhdq ymm29,ymm31,ymm6
00007FF69F123FEB vpunpcklqdq ymm30,ymm26,ymm27
00007FF69F123FF1 vpunpckhqdq ymm26,ymm26,ymm27
00007FF69F123FF7 vpunpcklqdq ymm27,ymm28,ymm29
00007FF69F123FFD vpunpckhqdq ymm28,ymm28,ymm29
00007FF69F124003 vinsertf32x4 ymm29,ymm30,xmm26,1
00007FF69F12400A vmovups ymmword ptr [rcx+rax],ymm29
00007FF69F124011 vinsertf32x4 ymm29,ymm27,xmm28,1
00007FF69F124018 vmovups ymmword ptr [rcx+rax+20h],ymm29
00007FF69F124020 vshuff64x2 ymm26,ymm30,ymm26,3
00007FF69F124027 vmovupd ymmword ptr [rcx+rax+40h],ymm26
00007FF69F12402F vshuff64x2 ymm26,ymm27,ymm28,3
00007FF69F124036 vmovupd ymmword ptr [rcx+rax+60h],ymm26
for (uint32_t i0{}; i0 < vector_count; ++i0) {
00007FF69F12403E sub rax,0FFFFFFFFFFFFFF80h
00007FF69F124042 cmp rax,3A9800h
00007FF69F124048 jne mat4_mul_vec4_cpp+1ACh (07FF69F123EFCh)
}
}
00007FF69F12404E vmovaps xmm6,xmmword ptr [rsp]
00007FF69F124053 vmovaps xmm7,xmmword ptr [rsp+10h]
00007FF69F124059 vmovaps xmm8,xmmword ptr [rsp+20h]
00007FF69F12405F add rsp,38h
00007FF69F124063 vzeroupper
00007FF69F124066 ret
This is the calling code for the benchmark, when using the same source and destination vector array:
int main() {
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
SetThreadPriorityBoost(GetCurrentThread(), false);
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL);
mat4f_t matrix{ {0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15} };
vec4f_t* dst_vector = new vec4f_t[1'000'000]{};
vec4f_t* src_vector = new vec4f_t[1'000'000]{};
vec3f_t* dst_vector0 = new vec3f_t[1'000'000]{};
vec3f_t* src_vector0 = new vec3f_t[1'000'000]{};
for (uint32_t i0{}; i0 < 1000000; ++i0) {
src_vector[i0] = vec4f_t{ (float)i0, (float)i0, (float)i0, (float)i0 };
src_vector0[i0] = vec3f_t{ (float)i0, (float)i0, (float)i0 };
}
set_mxcsr0(0b1111'1111'1100'0000);
for (uint64_t i0{}; i0 < 30000; ++i0) {
for (uint32_t i1{}; i1 < 16; ++i1) {
reinterpret_cast<float*>(matrix.element_)[i1] = 1. / i0;
}
mat4_mul_vec4_avx512(src_vector, matrix, src_vector, 240000);
mat4_mul_vec4_cpp(src_vector, matrix, src_vector, 240000);
mat4_mul_vec3_avx512(src_vector0, matrix, src_vector0, 240000);
mat4_mul_vec3_cpp(src_vector0, matrix, src_vector0, 240000);
fps();
}
for (uint32_t i0{}; i0 < 1000000; ++i0) {
std::cout << src_vector0[i0] << std::endl;
std::cout << src_vector[i0] << std::endl;
}
}
When testing with different source and destination arrays src_vector/src_vector0 as the first parameter would be replaced by dst_vector/dst_vector0.
These are the benchmark results when using the same source and destination arrays:
Same source/destination arrays, Assembly performs much better
These are the benchmark results when using different source and destination arrays:
Different source/destination arrays, Assembly performs nearly the same as C++
The benchmarks have been created on a machine with Windows 11 with an 11th Gen i7-11850H Tiger Lake CPU using the Intel C++ Compiler 2024 and, as already mentioned, the Intel VTune Profiler.
There are a few things which I don't understand:
Why does the speed vary when using different/the same source and destination arrays? Does it have anything to do with the cache in this case?
Why does the C++ counterpart even reach such good performance near the Assembly variant when using different source and destination arrays?
Thank you for your help. I really appreciate that.
Your first VTune screenshot (the source=dest version) shows the cpp version running 100% scalar ops (far right column), vs. the separate dest case running 100% packed SIMD FP ops.
It's common for auto-vectorization to check for overlap and fall back to a scalar loop if an output overlaps any input, not making a third version that's specialized for dst=src (e.g. perhaps loading everything first, if that still matches the C++ semantics for that case). So it's probably using the scalar fallback because its checks detected overlap. (In code that always has disjoint dst and src, you'd use __restrict
to promise that and optimize away those checks. It might happen to compile to asm that works in the perfect overlap dst=src case, but that's still UB so don't do it, it could break in future or with different surrounding code to inline into.)
If your source special-cases the dst=src case, that could get auto-vectorized.
also according to the static analysis tools uica and llvm-mca multiple times faster. uica.uops.info
It's not that extreme; uiCA predicts a 1.75x speedup for your asm vs. LLVM's on Tiger Lake.
float4
) using YMM vectors. With perfect scheduling, a back-end port throughput bottleneck of 13.33 cycles, but apparently it predicts too many of the shuffles will get scheduled to port 1 and steal cycles from the MULs/FMAs.float4
float4
). As expected, if loads/stores aren't a bottleneck, out-of-order exec has no problem overlapping work across iterations and saturating port 0 with FMAs, port 5 with shuffles. The indexed addressing-mode isn't a problem since you don't use it as a memory operand for an ALU instruction, just for pure-load / pure-store.float4
, so more hyperthreading-friendly if the other logical core has some scalar integer work that can run on ports 1 and 6, or while this thread stalls on load/store bandwidth.So in the best-case no-stall case, 14.14/2 / 4.02 = 1.756
speed ratio in cycles per float4
. Maybe you missed the work per iteration factor (look at the pointer increments and addressing mode offsets), or you copy/pasted the whole function instead of just the inner loop? uiCA treats the asm as a loop body whether it ends with a branch or not, assuming any earlier branches are not-taken.
Memory / cache bandwidth bottlenecks might be the limiting factor, leaving both running at about the same speed? For equivalent code (not like this with tons of extra shuffles with 256-bit vectors), unaligned data can be maybe 15% worse DRAM bandwidth on Skylake Xeon with 512-bit vectors than with 256-bit vectors (where only every other load/store is a cache-line split.)
Source = destination is better for cache footprint since you're only touching half as much data. And better for memory bandwidth: without NT stores, a write costs a read+write if the data isn't already hot in cache, include a Read For Ownership to get the old value of a cache line and exclusive ownership. (I'm not sure if there are any optimizations for an aligned store of a whole 64-byte cache line, like whether that can just invalidate other copies without doing an RFO but without bypassing / evicting from cache like NT stores, instead like how rep movsb
/ rep stosb
microcode works.) Anyway, if DRAM bandwidth was a bottleneck, you'd expect in-place to be 1.5x as fast vs. copy, modulo other factors like mixed read+write having some overhead for the DRAM itself.
So the in-place case might be unshackling your asm from bandwidth bottlenecks, as well as causing the C++ to face-plant as it runs scalar code, magnifying the difference.
It looks like your code uses the same matrix every time, but with different vectors. And you load 4 float4
vectors as one ZMM vector.
With software pipelining of the load+shuffle ahead of the MUL+FMAs, which is really not necessary, all CPUs with AVX-512 have pretty deep out-of-order execution buffers to find instruction-level parallelism across iterations. It's a fairly short dependency chain.
Also, you load + shuffle a final vector but don't FMA + store it. You could peel a final set of MUL/FMA + store out of the end of the last iteration, and have your loop counter do one fewer iteration.
For Zen 4, you might use cmp/jne
on the pointer instead of sub
/jnz
on a separate counter. Zen can macro-fuse cmp/jcc
but not sub/jcc
.
It looks like ICX unrolled to do 128 bytes of data per iteration, but it is doing a lot of shuffle uops per input vector. One vpermps
(or vpermt2ps
if it needs to pull data from 2x YMM instead of 1x ZMM with the default -mprefer-vector-width=256
) per multiply should be better, but it's not inventing shuffle constants other than immediates for vshufps
.
It looks like it broadcasted each scalar element of the matrix to a separate YMM register, with 16x vbroadcastss
instructions. So this requires shuffles to line up the results.
Even with "client" chips having 1/clock ZMM FMA/mul/add but 2/clock YMM, I'd have expected your asm to be faster due to both front-end and back-end execution-port throughput bottlenecks from all those shuffles. Even if your data isn't aligned by 64, so every load and store is a cache-line split. (512-bit vectors make aligned data more important than with 256-bit.) Even with possible clock-speed penalties from using 512-bit vectors.
You probably didn't need to write this in asm. You could probably get about the same asm with intrinsics like _mm512_loadu_ps
and _mm512_fmadd_ps
. asm{}
does let you align 32
for that one specific loop to help out the uop cache, but Tiger Lake has a working LSD (loop buffer). (Skylake-X only has its uop cache, no LSD, and it has the JCC erratum creating even more code-alignment potholes for the bottom of loops.)