SSE: convert __m128 to float

I have the following piece of C code:

__m128 pSrc1 = _mm_set1_ps(4.0f);
__m128 pDest;
int i;
for (i=0;i<100;i++) {
       m1 = _mm_mul_ps(pSrc1, pSrc1);      
       m2 = _mm_mul_ps(pSrc1, pSrc1);        
       m3 = _mm_add_ps(m1, m2);             
       pDest = _mm_add_ps(m3, m3); 
}

float *arrq = (float*) pDest;

Everything until the end of the for loop works. What I am trying to do now is to cast the __m128 type back to float. Since it stores 4 floats I thought I easily can cast it back to float*. What am I doing wrong? (This is a test code, so don't wonder). I basically tried all possible conversions I could think of. Thx for your help.

Solution

You can to use _mm_store_ps to store a __m128 vector into a float array.

alignas(16) float result [4];
_mm_store_ps (result, pDest);

// If result is not 16-byte aligned, use _mm_storeu_ps
// On modern CPUs this is just as fast as _mm_store_ps if
// result is 16-byte aligned, but works in all other cases as well
_mm_storeu_ps (result, pDest);

You can then access any / all elements from that temporary array, and if you're lucky the compiler will turn this into a shuffle instead of store/reload if that's more efficient. (If the destination isn't just a temporary and you actually want all 4 elements stored somewhere, then _mm_storeu_ps or store is exactly what you want.)

If you want just the low element, float _mm_cvtss_f32(__m128) is good.

If you want to combine the vector elements down to a single float after a loop that sums an array or does a dot-product, see Fastest way to do horizontal SSE vector sum (or other reduction)