There is no informations about Latency and Throughput of _mm256_setr_epi32() on the intel intrinsics guide.
Does anyone know it or know a way to calculate it?
Thanks a lot!
It's unspecified by intel itself, here https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf table C.2 COMPOSITE INTRINSIC section hints that there will be a set of instructions, depending on input of the compiler.
in the case below when input is unpredictable by compiler it compiles with optimization to following instructions:
volatile int a = 1, b = 2, c = 3, d = 4, e = 5, f = 6, g = 7, h = 8;
00007FF71DE91013 mov dword ptr [rbp+1Ch],1
00007FF71DE9101A mov dword ptr [rbp+18h],2
00007FF71DE91021 mov dword ptr [rbp+14h],3
00007FF71DE91028 mov dword ptr [rbp+10h],4
00007FF71DE9102F mov dword ptr [rbp+0Ch],5
00007FF71DE91036 mov dword ptr [rbp+8],6
00007FF71DE9103D mov dword ptr [rbp+4],7
00007FF71DE91044 mov dword ptr [rbp],8
volatile __m256i reg = _mm256_setr_epi32(a,b,c,d,e,f,g,h);
00007FF71DE9104B mov ebx,dword ptr [rbp]
00007FF71DE9104E mov r11d,dword ptr [g]
00007FF71DE91052 mov r10d,dword ptr [f]
00007FF71DE91056 mov r9d,dword ptr [e]
00007FF71DE9105A mov r8d,dword ptr [d]
00007FF71DE9105E mov edx,dword ptr [c]
00007FF71DE91061 mov ecx,dword ptr [b]
00007FF71DE91064 mov eax,dword ptr [a]
00007FF71DE91067 vmovd xmm1,eax
00007FF71DE9106B vpinsrd xmm1,xmm1,ecx,1
00007FF71DE91071 vpinsrd xmm1,xmm1,edx,2
00007FF71DE91077 vmovd xmm0,r9d
00007FF71DE9107C vpinsrd xmm0,xmm0,r10d,1
00007FF71DE91082 vpinsrd xmm0,xmm0,r11d,2
00007FF71DE91088 vpinsrd xmm1,xmm1,r8d,3
00007FF71DE9108E vpinsrd xmm0,xmm0,ebx,3
00007FF71DE91094 vinsertf128 ymm0,ymm1,xmm0,1
00007FF71DE9109A vmovdqu ymmword ptr [rbp+20h],ymm0
But in case input is known to compiler it looks much shorter...
volatile __m256i reg = _mm256_setr_epi32(1,2,3,4,5,6,7,8);
00007FF7789C100F vmovdqu ymm0,ymmword ptr [__ymm@0000000800000007000000060000000500000004000000030000000200000001 (07FF7789C2200h)]
00007FF7789C1017 vmovdqu ymmword ptr [rbp],ymm0
So latency and cycles are not known even roughly. The right section to look in the assembler reference in any case I believe is VINSERTI128 description (currently page 1670 if follow my link above)