Search code examples
c++intelsseintrinsicssse2

Is there a difference between SVML vs. normal intrinsic square root functions?


Is there any sort of difference in precision or performance between normal sqrtps/pd or the SVML version:

     __m128d _mm_sqrt_pd (__m128d a) [SSE2]
     __m128d _mm_svml_sqrt_pd (__m128d a) [SSE?]
     __m128 _mm_sqrt_ps (__m128 a) [SSE]
     __m128 _mm_svml_sqrt_ps (__m128 a) [SSE?]

I know that SVML Intrinsics like _mm_sin_ps are actually functions consisting of potentially multiple asm instructions, thus they should be slower than any single multiply or even divide. However, I'm curious as to why these function exist if there are hardware-level Intrinsics available.

Were these SVML functions created before SSE2? Or is there a difference in precision?


Solution

  • I've inspected the code gen in MSVC.

    • _mm_svml_sqrt_pd compiles into a function call; the called function consists of a single sqrtpd followed by ret
    • _mm_svml_sqrt_ps compiles into a function call; the called function consists of a single sqrtps followed by ret
    • _mm_sqrt_pd and _mm_sqrt_ps intrinsics compile to inlined sqrtpd and sqrtps

    A possible explanation (just guess): SVML intended to have CPU dispatch, but the version compiled for MSVC has this CPU dispatch disabled. The goal may be to implement it differently for Xeon Phi, the Xeon Phi version may be not included in MSVC build of SVML.


    Screenshot: enter image description here


    When using Intel compiler, it is using svml_dispmd.dll, and there's actual dispatch function (real indirect jump ff 25 42 08 00 00), which ends up in vsqrtpd for me