Why is glm::dot faster than helper_math.h's and my implementation

Problem

I had some performance issues with glm matrix-vector multiplications in CUDA, documented in the bug tracker (linking because it might be useful for others).

While doing some performance tests about it, I found out that glm's implementation of the dot product is faster than CUDA's helper_math.h implementation.

inline __host__ __device__ float dot(float4 a, float4 b)
{
   return a.x * b.x + a.y * b.y + a.z * b.z + a.w * b.w;
}

I tried to mimic glm's implementation, but the performance stayed the same:

inline __host__ __device__ float dot(float4 a, float4 b)
{
   float4 tmp = a * b;
   return (tmp.x + tmp.y) + (tmp.z + tmp.w);
}

This is what appears to me to be glm's implementation:

template <typename T, precision P>
struct compute_dot<detail::tvec4, T, P>
{
    GLM_FUNC_QUALIFIER static T call(detail::tvec4<T, P> const & x, detail::tvec4<T, P> const & y)
    {
        detail::tvec4<T, P> tmp(x * y);
        return (tmp.x + tmp.y) + (tmp.z + tmp.w);
    }
};

edit: The following also didn't improve the performance:

inline __host__ __device__ float dot(float4 a, float4 b)
{
   a *= b;
   return (a.x + a.y) + (a.z + a.w);
}

Results

The difference is quite apparent:

time for cuda glm (dot): 223 milliseconds
time for cuda helper math (dot): 307 milliseconds

Testing method

I used the following kernel to test (analogous for glm, numElements 2000000, innerLoopSize 100) and std::chrono::high_resolution_clock for measurements.

__global__ void cuDotKernel(const float4 *vectors, float4 *result, int numElements, int innerLoopSize) {
   int i = blockDim.x * blockIdx.x + threadIdx.x;
   if(i < numElements) {
       result[i] = vectors[i];
       if(i>1 && i < NUM_ELEMENTS - 1) {
           for(int j=0; j<innerLoopSize; j++) {
                result[i].y = dot(vectors[i+1], vectors[i]);
                result[i].x = dot(vectors[i-1], vectors[i]);
                result[i].z = dot(vectors[i+1], result[i]);
                result[i].w = dot(vectors[i-1], result[i]);
            }
        }
    }
}

The full testing code is on bitbucket (the repo is public), it's possible to download directly (without hg).

I tested on Linux, CUDA 6.5, GeForce GTX 550ti and with glm 0.9.5.4.

Question

Now the question is, why is glm's implementation faster? And how could I improve helper_maths's code to be as fast?

Solution

I changed the testing data in a way that the vectors don't "explode", meaning that they stay in a sensible range [0.0 - 1000.0]. Now the timings are very similar.

It seems like the CUDA compiler was able to optimise out some computations in the case of glm, but not in the case of helper_math.