c++vectorization sse simultaneous vector-multiplication

Simultaneously multiply all struct-elements with a scalar

I have a struct that represents a vector. This vector consists of two one-byte integers. I use them to keep values from 0 to 255.

typedef uint8_T unsigned char;

struct Vector
{
  uint8_T x;
  uint8_T y;
};

Now, the main use case in my program is to multiply both elements of the vector with a 32bit float value:

typedef real32_T float;

Vector Vector::operator * ( const real32_T f ) const {
  return Vector( (uint8_T)(x * f), (uint8_T)(y * f) );
};

This needs to be performed very often. Is there a way that these two multiplications can be performed simultaneously? Maybe by vectorization, SSE or similar? Or is the Visual studio compiler already doing this simultaneously?

Another usecase is to interpolate between two Vectors.

Vector Vector::interpolate(const Vector& rhs, real32_T z) const
{
  return Vector(
        (uint8_T)(x + z * (rhs.x - x)),
        (uint8_T)(y + z * (rhs.y - y))
        );
}

This already uses an optimized interpolation aproach (https://stackoverflow.com/a/4353537/871495).

But again the values of the vectors are multiplied by the same scalar value. Is there a possibility to improve the performance of these operations?

Thanks

(I am using Visual Studio 2010 with an 64bit compiler)

Solution

In my experience, Visual Studio (especially an older version like VS2010) does not do a lot of vectorization on its own. They have improved it in the newer versions, so if you can, you might see if a change of compiler speeds up your code.

Depending on the code that uses these functions and the optimization the compiler does, it may not even be the calculations that slow down your program. Function calls and cache misses may hurt a lot more.

You could try the following:

If not already done, define the functions in the header file, so the compiler can inline them
If you use these functions in a tight loop, try doing the calculations 'by hand' without any function calls (temporarily expose the variables) and see if it makes a speed difference)
If you have a lot of vectors, look at how they are laid out in memory. Store them contiguously to minimize cache misses.
For SSE to work really well, you'd have to work with 4 values at once - so multiply 2 vectors with 2 floats. In a loop, use a step of 2 and write a static function that calculates 2 vectors at once using SSE instructions. Because your vectors are not aligned (and hardly ever will be with 8 bit variables), the code could even run slower than what you have now, but it's worth a try.
If applicable and if you don't depend on the clamping that occurs with your cast from float to uint8_t (e.g. if your floats are in range [0,1]), try using float everywhere. This may allow the compiler do do far better optimization.