Search code examples
c++simdsseavx

Set Last Value in __m128 vector register


So I have a set of data with mixed values for packing purposes that goes like this:

{(Point_x, Point_y, Point_z, Scalar),
 (Point_x, Point_y, Point_z, Scalar),
 (Point_x, Point_y, Point_z, Scalar),
...}

Where each Point_x, Point_y, Point_z, and Scalar are 32 bit floats.

Because of that I can load each point aligned, but i need to move the point x,y,z into its own register for my operation and then set the last value to a 1.f in the __m128 register (where scalar would be). What instruction is used to set the last value to a 1.f in the register and leave the other values untouched?

currently I am doing:

    __m128 rPointMixed = _mm_load_ps( (float*)pPoint );
    __m128 rOne =_mm_set1_ps(1.f);
    __m128 rPoint = _mm_blend_ps(rPointMixed,rOne,0x8);

but might not be the most efficent solution, i am fine with sse4/avx/avx2 instruction though if it there is a really efficent way with them

struct Vec4f
{
    float x;
    float y;
    float z;
    float scalar;
};

Vec4f vData[10000];

//in reality this loop is unrolled to do 8 at a time, but rolled up for simplicity sake
for( int i = 0; i < 10000; ++i)
{
    __m128 rPointData = _mm_load_ps( (float*)vData[i] );
    //math there where it permutes the scalar and does math with it

   //3 just cause it is the last value?
   __m128 rPoint = [unknownIntrinsic](rPointData ,1.f, 3); //?
  
  //point math here

}

Thanks for the help in advance


Solution

  • Unless there's an instruction with a better throughput than blendps (0.33 CPI on Intel), what you're doing is already ideal.

    Note that you don't actually need to call _mm_set1_ps(1.f) for every iteration of the loop (so your "unknown intrinsic" is actually just _mm_blend_ps), as rOne is constant. With optimizations enabled, however, most compilers will be smart enough to do that just once before the loop.