I have this function which uses SSE2 to add some values together it's supposed to add lhs and rhs together and store the result back into lhs:
template<typename T>
void simdAdd(T *lhs,T *rhs)
{
asm volatile("movups %0,%%xmm0"::"m"(lhs));
asm volatile("movups %0,%%xmm1"::"m"(rhs));
switch(sizeof(T))
{
case sizeof(uint8_t):
asm volatile("paddb %%xmm0,%%xmm1":);
break;
case sizeof(uint16_t):
asm volatile("paddw %%xmm0,%%xmm1":);
break;
case sizeof(float):
asm volatile("addps %%xmm0,%%xmm1":);
break;
case sizeof(double):
asm volatile("addpd %%xmm0,%%xmm1":);
break;
default:
std::cout<<"error"<<std::endl;
break;
}
asm volatile("movups %%xmm0,%0":"=m"(lhs));
}
and my code uses the function like this:
float *values=new float[4];
float *values2=new float[4];
values[0]=1.0f;
values[1]=2.0f;
values[2]=3.0f;
values[3]=4.0f;
values2[0]=1.0f;
values2[1]=2.0f;
values2[2]=3.0f;
values2[3]=4.0f;
simdAdd(values,values2);
for(uint32_t count=0;count<4;count++) std::cout<<values[count]<<std::endl;
However this isn't working because when the code runs it outputs 1,2,3,4 instead of 2,4,6,8
I've found that inline assembly support isn't reliable in most modern compilers (as in, the implementations are just plain buggy). You are generally better off using compiler intrinsics which are declarations that look like C functions, but actually compile to a specific opcode.
Intrinsics let you specify an exact sequence of opcodes, but leave the register coloring to the compiler. It's much more reliable than trying to move data between C variables and asm registers, which is where inline assemblers have always fallen down for me. It also lets the compiler schedule your instructions, which can provide better performance if it works around pipeline hazards. Ie, in this case you could do
void simdAdd(float *lhs,float *rhs)
{
_mm_storeu_ps( lhs, _mm_add_ps(_mm_loadu_ps( lhs ), _mm_loadu_ps( rhs )) );
}
In your case, anyway, you've two problems:
*lhs
and *rhs
instead of just lhs and rhs; apparently the "=m" syntax means "implicitly use a pointer to this thing that I'm passing you instead of the thing itself."xmm1
, not xmm0
.I've put a fixed example on codepad (to avoid cluttering up this answer, and to demonstrate that it works).