Is there any sort of difference in precision or performance between normal sqrtps/pd or the SVML version:
__m128d _mm_sqrt_pd (__m128d a) [SSE2]
__m128d _mm_svml_sqrt_pd (__m128d a) [SSE?]
__m128 _mm_sqrt_ps (__m128 a) [SSE]
__m128 _mm_svml_sqrt_ps (__m128 a) [SSE?]
I know that SVML Intrinsics like _mm_sin_ps
are actually functions consisting of potentially multiple asm instructions, thus they should be slower than any single multiply or even divide. However, I'm curious as to why these function exist if there are hardware-level Intrinsics available.
Were these SVML functions created before SSE2? Or is there a difference in precision?
I've inspected the code gen in MSVC.
_mm_svml_sqrt_pd
compiles into a function call; the called function consists of a single sqrtpd
followed by ret
_mm_svml_sqrt_ps
compiles into a function call; the called function consists of a single sqrtps
followed by ret
_mm_sqrt_pd
and _mm_sqrt_ps
intrinsics compile to inlined sqrtpd
and sqrtps
A possible explanation (just guess): SVML intended to have CPU dispatch, but the version compiled for MSVC has this CPU dispatch disabled. The goal may be to implement it differently for Xeon Phi, the Xeon Phi version may be not included in MSVC build of SVML.
When using Intel compiler, it is using svml_dispmd.dll
, and there's actual dispatch function (real indirect jump ff 25 42 08 00 00
), which ends up in vsqrtpd for me