Search code examples
c++gccinline-assemblysse2

SSE2 instructions not working in inline assembly with C++


I have this function which uses SSE2 to add some values together it's supposed to add lhs and rhs together and store the result back into lhs:

template<typename T>
void simdAdd(T *lhs,T *rhs)
{
    asm volatile("movups %0,%%xmm0"::"m"(lhs));
    asm volatile("movups %0,%%xmm1"::"m"(rhs));

    switch(sizeof(T))
    {
        case sizeof(uint8_t):
        asm volatile("paddb %%xmm0,%%xmm1":);
        break;

        case sizeof(uint16_t):
        asm volatile("paddw %%xmm0,%%xmm1":);
        break;

        case sizeof(float):
        asm volatile("addps %%xmm0,%%xmm1":);
        break;

        case sizeof(double):
        asm volatile("addpd %%xmm0,%%xmm1":);
        break;

        default:
        std::cout<<"error"<<std::endl;
        break;
    }

    asm volatile("movups %%xmm0,%0":"=m"(lhs));
}

and my code uses the function like this:

float *values=new float[4];
float *values2=new float[4];

values[0]=1.0f;
values[1]=2.0f;
values[2]=3.0f;
values[3]=4.0f;

values2[0]=1.0f;
values2[1]=2.0f;
values2[2]=3.0f;
values2[3]=4.0f;

simdAdd(values,values2);
for(uint32_t count=0;count<4;count++) std::cout<<values[count]<<std::endl;

However this isn't working because when the code runs it outputs 1,2,3,4 instead of 2,4,6,8


Solution

  • I've found that inline assembly support isn't reliable in most modern compilers (as in, the implementations are just plain buggy). You are generally better off using compiler intrinsics which are declarations that look like C functions, but actually compile to a specific opcode.

    Intrinsics let you specify an exact sequence of opcodes, but leave the register coloring to the compiler. It's much more reliable than trying to move data between C variables and asm registers, which is where inline assemblers have always fallen down for me. It also lets the compiler schedule your instructions, which can provide better performance if it works around pipeline hazards. Ie, in this case you could do

    void simdAdd(float *lhs,float *rhs)
    {
       _mm_storeu_ps( lhs, _mm_add_ps(_mm_loadu_ps( lhs ), _mm_loadu_ps( rhs )) );
    }
    

    In your case, anyway, you've two problems:

    1. The terrible GCC inline assembly syntax which makes great confusion of the difference between pointers and values. Use *lhs and *rhs instead of just lhs and rhs; apparently the "=m" syntax means "implicitly use a pointer to this thing that I'm passing you instead of the thing itself."
    2. GCC has a source,destination syntax -- The addps stores its result in the second parameter, so you you need to output xmm1, not xmm0.

    I've put a fixed example on codepad (to avoid cluttering up this answer, and to demonstrate that it works).