Search code examples
assemblyx86simdssedot-product

Dot product performance with SSE instructions: is DPPS worth using?


Is it faster to calculate the dot product of two short (4-element) vectors by the means of SSE4.1 dpps or by using a series of mulps / shufps / addps instructions from SSE 1?

(For big vectors, of course it's best to mulps/addps into vector accumulators and do one horizontal sum at the end, with enough sum vectors to hide FP latency)


Solution

  • The answer is likely to be very contextual, and depend exactly on where and how it's used in the larger codeflow as well as exactly what hardware you are using.

    Historically when Intel has introduced new instructions, they've not dedicated much hardware area to it. If it gets adopted and used enough, they put more hardware behind it in future generations. So _mm_dp_ps on Penryn wasn't particularly impressive compared to doing it the SSE2 way in terms of raw ALU performance. On the other hand, it does require fewer instructions in the I-cache so it could potentially help when a more compact encoding would perform better.

    The real problem with _mm_dp_ps is as part of SSE 4.1, you can't count on it being supported on every even modern PC (Valve's Steam Hardware Survey pegs it at about 85% for gamers). Therefore, you end up having to write guarded code-paths rather than straight-line code, and that usually costs more than the benefits you get from using the instruction.

    Update February 2024: SSE2/SSE3 are at 100% on Steam. SSSE3, SSE4.1, SSE 4.2 are at 99%. Even AVX and AVX2 is in the upper 90s. These days it would be reasonable to build your game with /arch:AVX, check that SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2/AVX are all supported at start-up, and just not support machines that lack any of those.

    Where it is useful is if you are making a binary for a CPU that's guaranteed to support it. For example, if you are building with /arch:AVX (or even /arch:AVX2) either because you are targeting a fixed platform like the Xbox One or are building multiple versions of your EXE/DLL, you can assume SSE 4.1 will be supported as well.

    This is effectively what DirectXMath does:

    inline XMVECTOR XMVector4Dot( FXMVECTOR V1, FXMVECTOR V2 )
    {
    #if defined(_XM_NO_INTRINSICS_)
    
        XMVECTOR Result;
        Result.vector4_f32[0] =
        Result.vector4_f32[1] =
        Result.vector4_f32[2] =
        Result.vector4_f32[3] = V1.vector4_f32[0] * V2.vector4_f32[0] + V1.vector4_f32[1] * V2.vector4_f32[1] + V1.vector4_f32[2] * V2.vector4_f32[2] + V1.vector4_f32[3] * V2.vector4_f32[3];
        return Result;
    
    #elif defined(_M_ARM) || defined(_M_ARM64)
    
        float32x4_t vTemp = vmulq_f32( V1, V2 );
        float32x2_t v1 = vget_low_f32( vTemp );
        float32x2_t v2 = vget_high_f32( vTemp );
        v1 = vpadd_f32( v1, v1 );
        v2 = vpadd_f32( v2, v2 );
        v1 = vadd_f32( v1, v2 );
        return vcombine_f32( v1, v1 );
    
    #elif defined(__AVX__) || defined(__AVX2__)
    
        return _mm_dp_ps( V1, V2, 0xff );
    
    #elif defined(_M_IX86) || defined(_M_X64)
    
        XMVECTOR vTemp2 = V2;
        XMVECTOR vTemp = _mm_mul_ps(V1,vTemp2);
        vTemp2 = _mm_shuffle_ps(vTemp2,vTemp,_MM_SHUFFLE(1,0,0,0));
        vTemp2 = _mm_add_ps(vTemp2,vTemp);
        vTemp = _mm_shuffle_ps(vTemp,vTemp2,_MM_SHUFFLE(0,3,0,0));
        vTemp = _mm_add_ps(vTemp,vTemp2);
        return _mm_shuffle_ps(vTemp,vTemp,_MM_SHUFFLE(2,2,2,2));
    
    #else
        #error Unsupported platform
    #endif
    }
    

    This of course assumes you are going to use the 'scalar' result of a dot-product in additional vector operations. By convention, DirectXMath returns such scalars 'splatted' across the return vector.

    See DirectXMath: SSE4.1 and SSE4.2

    UPDATE: While not quite as ubiquitous as SSE/SSE2 support, you could require SSE3 support for cases you aren't building with /arch:AVX or /arch:AVX2 and try:

    inline XMVECTOR XMVector4Dot(FXMVECTOR V1, FXMVECTOR V2)
    {
        XMVECTOR vTemp = _mm_mul_ps(V1,V2);
        vTemp = _mm_hadd_ps( vTemp, vTemp );
        return _mm_hadd_ps( vTemp, vTemp );
    }
    

    That said, it's not clear that hadd is much of a win in most cases for at least dot-product over the SSE/SSE2 add and shuffle solution.