I have the following piece of C code:
__m128 pSrc1 = _mm_set1_ps(4.0f);
__m128 pDest;
int i;
for (i=0;i<100;i++) {
m1 = _mm_mul_ps(pSrc1, pSrc1);
m2 = _mm_mul_ps(pSrc1, pSrc1);
m3 = _mm_add_ps(m1, m2);
pDest = _mm_add_ps(m3, m3);
}
float *arrq = (float*) pDest;
Everything until the end of the for loop works. What I am trying to do now is to cast the __m128 type back to float. Since it stores 4 floats I thought I easily can cast it back to float*. What am I doing wrong? (This is a test code, so don't wonder). I basically tried all possible conversions I could think of. Thx for your help.
You can to use _mm_store_ps
to store a __m128
vector into a float array.
alignas(16) float result [4];
_mm_store_ps (result, pDest);
// If result is not 16-byte aligned, use _mm_storeu_ps
// On modern CPUs this is just as fast as _mm_store_ps if
// result is 16-byte aligned, but works in all other cases as well
_mm_storeu_ps (result, pDest);
You can then access any / all elements from that temporary array, and if you're lucky the compiler will turn this into a shuffle instead of store/reload if that's more efficient. (If the destination isn't just a temporary and you actually want all 4 elements stored somewhere, then _mm_storeu_ps
or store
is exactly what you want.)
If you want just the low element, float _mm_cvtss_f32(__m128)
is good.
If you want to combine the vector elements down to a single float after a loop that sums an array or does a dot-product, see Fastest way to do horizontal SSE vector sum (or other reduction)