Search code examples
c++simdglm-math

Does GLM use SIMD automatically? (and a question about glm performance)


I would like to check if glm uses SIMD on my machine or not. CPU: 4th gen i5, OS: ArchLinux(up to date), IDE: QtCreator.

I wrote a little application to test it:

#include <iostream>
#include <chrono>
//#define GLM_FORCE_SSE2
//#define GLM_FORCE_ALIGNED
#include <glm/glm.hpp>
#include <xmmintrin.h>
float glm_dot(const glm::vec4& v1, const glm::vec4& v2)
{
   auto start = std::chrono::steady_clock::now();
   auto res = glm::dot(v1, v2);
   auto end = std::chrono::steady_clock::now();
   std::cout << "glm_dot:\t\t" << res << " elasped time: " <<    std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;
   return res;
}

float dot_pure(const glm::vec4& v1, const glm::vec4& v2)
{
   auto start = std::chrono::steady_clock::now();
   auto res = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2];
   auto end = std::chrono::steady_clock::now();
   std::cout << "dot_pure:\t\t" << res << " elasped time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;
   return res;
}

float dot_simd(const float& v1, const float& v2)
{
   auto start = std::chrono::steady_clock::now();
   const __m128& v1m = reinterpret_cast<const __m128&>(v1);
   const __m128& v2m = reinterpret_cast<const __m128&>(v2);
   __m128 mul =  _mm_mul_ps(v1m, v2m);
   auto res = mul[0] + mul[1] + mul[2];
   auto end = std::chrono::steady_clock::now();
   std::cout << "dot_simd:\t\t" << res << " elasped time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;
   return res;
}

float dot_simd_glm_type(const glm::vec4& v1, const glm::vec4& v2)
{
   auto start = std::chrono::steady_clock::now();
   const __m128& v1m = reinterpret_cast<const __m128&>(v1);
   const __m128& v2m = reinterpret_cast<const __m128&>(v2);
   __m128 mul =  _mm_mul_ps(v1m, v2m);
   auto res = mul[0] + mul[1] + mul[2];
   auto end = std::chrono::steady_clock::now();
   std::cout << "dot_simd_glm_type:\t" << res << " elasped time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << std::endl;
   return res;
}

int main()
{
   glm::vec4 v1 = {1.1f, 2.2f, 3.3f, 0.0f};
   glm::vec4 v2 = {3.0f, 4.0f, 5.0f, 0.0f};
   float v1_raw[] = {1.1f, 2.2f, 3.3f, 0.0f};
   float v2_raw[] = {3.0f, 4.0f, 5.0f, 0.0f};
   glm_dot(v1, v2);
   dot_pure(v1, v2);
   dot_simd(*v1_raw, *v2_raw);
   dot_simd_glm_type(v1, v2);
   return 0;
}

The glm_dot() calls glm::dot, the other functions are my implementations. When I run it in Debug mode a typical result is:

glm_dot:        28.6 elasped time: 487
dot_pure:       28.6 elasped time: 278
dot_simd:       28.6 elasped time: 57
dot_simd_glm_type:  28.6 elasped time: 52

glm::dot call compute_dot::call from func_geometric.inl which is a “pure” implementation of dot function. I don’t understand why does it take more time for glm::dot (usually) than my dot_pure() implementation, but it is debug mode so, let’s move on to Release:

glm_dot:        28.6 elasped time: 116
dot_pure:       28.6 elasped time: 53
dot_simd:       28.6 elasped time: 54
dot_simd_glm_type:28.6 elasped time: 54

Not always but usually my pure implementation takes less time than the simd version. Maybe this is because of the compiler can use simd in my pure implementation too, I don’t know.

  1. However typically the glm::dot call is much slower than the other three implementations. Why? Maybe glm uses pure implementation this time, too? When I use ReleaseWithDebugInfos this seems to be the case.

If I comment out the two defines in the source code (to force using simd) than I got better results, but usually the the glm::dot call is still slower. (To debug in ReleaseWithDebugInfos doesn’t show anything this time)

glm_dot:        28.6 elasped time: 88
dot_pure:       28.6 elasped time: 63
dot_simd:       28.6 elasped time: 53
dot_simd_glm_type:28.6 elasped time: 53
  1. Shouldn’t the glm use simd by default whenever it is possible? However according to the doc maybe it is not automatic at all: GLM provides some SIMD optimizations based on compiler intrinsics. These optimizations will be automatically thanks to compiler arguments. For example, if a program is compiled with Visual Studio using /arch:AVX, GLM will detect this argument and generate code using AVX instructions automatically when available. (source: https://chromium.googlesource.com/external/github.com/g-truc/glm/+/0.9.9-a2/manual.md)

  2. There is a glm test called test-core_setup_message, if I run it, it seems glm does not detect my arch (which would mean SSE, SSE2, etc):

$ ./test-core_setup_message
__cplusplus: 201703
GCC 8
GLM_MODEL_64
GLM_ARCH: 

So to summarize up my question does glm uses simd instructions automatically or not? Some part of the documentation says it is automatic, some other says it depends on the compiler flags. When I force the usage of SSE2, why is it still slower than my simd call?


Solution

  • If I comment out the two defines in the source code (to force using simd) than I got better results, but usually the the glm::dot call is still slower. (To debug in ReleaseWithDebugInfos doesn’t show anything this time)

    Your test is not very rigorous, and is prone to running into memory caching artifacts.

    Case in point, just shuffling around the order of tests I got: (compiling with -O3 -march=x86-64 -mavx2 and your defines unset):

    dot_simd:       28.6 elasped time: 170
    dot_pure:       28.6 elasped time: 54
    dot_simd_glm_type:  28.6 elasped time: 46
    glm_dot:        28.6 elasped time: 47
    

    You need to be running these kind of tests using a benchmarking library, such as Google Benchmark.

    But even then. "runs faster" is only a rough proxy test for "uses SIMD". You are way better off actually looking at the resulting assembly.

    I removed the timing code from your examples, and obtained the following See on godbolt:

    glm_dot(glm::vec<4, float, (glm::qualifier)0> const&, glm::vec<4, float, (glm::qualifier)0> const&):
            vmovss  xmm0, DWORD PTR [rdi+4]
            vmovss  xmm1, DWORD PTR [rdi]
            vmulss  xmm0, xmm0, DWORD PTR [rsi+4]
            vmovss  xmm2, DWORD PTR [rdi+8]
            vmulss  xmm1, xmm1, DWORD PTR [rsi]
            vmulss  xmm2, xmm2, DWORD PTR [rsi+8]
            vaddss  xmm0, xmm0, xmm1
            vmovss  xmm1, DWORD PTR [rdi+12]
            vmulss  xmm1, xmm1, DWORD PTR [rsi+12]
            vaddss  xmm1, xmm1, xmm2
            vaddss  xmm0, xmm0, xmm1
            ret
    dot_simd(float const&, float const&):
            vmovaps xmm1, XMMWORD PTR [rsi]
            vmulps  xmm1, xmm1, XMMWORD PTR [rdi]
            vshufps xmm2, xmm1, xmm1, 85
            vaddss  xmm0, xmm1, xmm2
            vunpckhps       xmm1, xmm1, xmm1
            vaddss  xmm0, xmm0, xmm1
            ret
    

    So you are correct that SIMD is evidently not used by default.