I'm trying to write a C++ program, which launches a function I write in x64 assembler. I'd like to speed things up a little (and play with CPU features), so I chose to use vector operations.
The problem is, I have to multiply sines by an integer, so I have to calculate the sines first.
Is it possible to do this in SSE/AVX? I'm aware of instruction fsin
, but not only it is in FPU, but also it calculates only 1 sine at once. So I'd have to push it in FPU, call fsin
, pop it from FPU to memory, and then put it in AVX register. It seems to me it's not worth the hassle.
Yes, there is a vector version using SSE/AVX! But the catch is that Intel C++ compiler must be used.
This is called Intel small vector math library (intrinsics):
for 128bit SSE please use (double precision): _mm_sin_pd
for 256bit AVX please use (double precision): _mm256_sin_pd
The two intrinsics are actually very small functions consists of hand written SSE/AVX assemblies, and now you can process 4 sine calculations at once by using AVX :=) the latency is about ~10 clock cycles (if I remember correctly) on Haswell CPU.
By the way, the CPU needs to execute about 100 such intrinsics to warm up and reach its peak performance, if there is only a few sin functions needs to be evaluated, it's better to use plain sin() instead.
Good luck!!