Search code examples
c++intellatencyavxthroughput

Latency and Throughput of _mm256_setr_epi32()


There is no informations about Latency and Throughput of _mm256_setr_epi32() on the intel intrinsics guide.
Does anyone know it or know a way to calculate it?

Thanks a lot!


Solution

  • It's unspecified by intel itself, here https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf table C.2 COMPOSITE INTRINSIC section hints that there will be a set of instructions, depending on input of the compiler.

    in the case below when input is unpredictable by compiler it compiles with optimization to following instructions:

        volatile int a = 1, b = 2, c = 3, d = 4, e = 5, f = 6, g = 7, h = 8;
        00007FF71DE91013  mov         dword ptr [rbp+1Ch],1  
        00007FF71DE9101A  mov         dword ptr [rbp+18h],2  
        00007FF71DE91021  mov         dword ptr [rbp+14h],3  
        00007FF71DE91028  mov         dword ptr [rbp+10h],4  
        00007FF71DE9102F  mov         dword ptr [rbp+0Ch],5  
        00007FF71DE91036  mov         dword ptr [rbp+8],6  
        00007FF71DE9103D  mov         dword ptr [rbp+4],7  
        00007FF71DE91044  mov         dword ptr [rbp],8  
    
        volatile __m256i reg = _mm256_setr_epi32(a,b,c,d,e,f,g,h);
        00007FF71DE9104B  mov         ebx,dword ptr [rbp]  
        00007FF71DE9104E  mov         r11d,dword ptr [g]  
        00007FF71DE91052  mov         r10d,dword ptr [f]  
        00007FF71DE91056  mov         r9d,dword ptr [e]  
        00007FF71DE9105A  mov         r8d,dword ptr [d]  
        00007FF71DE9105E  mov         edx,dword ptr [c]  
        00007FF71DE91061  mov         ecx,dword ptr [b]  
        00007FF71DE91064  mov         eax,dword ptr [a]  
        00007FF71DE91067  vmovd       xmm1,eax  
        00007FF71DE9106B  vpinsrd     xmm1,xmm1,ecx,1  
        00007FF71DE91071  vpinsrd     xmm1,xmm1,edx,2  
        00007FF71DE91077  vmovd       xmm0,r9d  
        00007FF71DE9107C  vpinsrd     xmm0,xmm0,r10d,1  
        00007FF71DE91082  vpinsrd     xmm0,xmm0,r11d,2  
        00007FF71DE91088  vpinsrd     xmm1,xmm1,r8d,3  
        00007FF71DE9108E  vpinsrd     xmm0,xmm0,ebx,3  
        00007FF71DE91094  vinsertf128 ymm0,ymm1,xmm0,1  
        00007FF71DE9109A  vmovdqu     ymmword ptr [rbp+20h],ymm0 
    

    But in case input is known to compiler it looks much shorter...

        volatile __m256i reg = _mm256_setr_epi32(1,2,3,4,5,6,7,8);
        00007FF7789C100F  vmovdqu     ymm0,ymmword ptr [__ymm@0000000800000007000000060000000500000004000000030000000200000001 (07FF7789C2200h)]  
        00007FF7789C1017  vmovdqu     ymmword ptr [rbp],ymm0
    

    So latency and cycles are not known even roughly. The right section to look in the assembler reference in any case I believe is VINSERTI128 description (currently page 1670 if follow my link above)