What is the best/fastest way to calculate x % M using vector instructions on x64/sse? (By % I mean mod/remainder).
I couldn't find any opcode for packed mod, so I think the best I could do is promote int to float, then calculate x - m * floor(x / m) using DIVPS and ROUNDPS.
Or is there a better alternative that I'm missing?
UPDATE: M is only known at runtime, the actual loop looks like this:
unsigned x[SIZE], M[SIZE], answer[SIZE];
for (int i = 0; i < SIZE; i++) {
answer[i] = x[i] % M[i];
}
Also M is known to be in the range 1 - 640000000, if it helps in any way.
If M
is either a compile time constant or is constant within a loop then instead of using division you can calculated a reciprocal and then do multiplication and a shift. We can write
x/M = (x*(2^n/M))>>n
The factor 2^n/M
(aka magic number) should be calculated before the loop or at compile time.
For example if we want x[i]/5
and we know that x[i]
is less than 2^15
we can use 2^n/M = 0xCCCD
and n = 18
.
#include <stdio.h>
#define N 32768
int x[N], y[N], z[N];
int main(void) {
for(int i=0; i<N; i++) x[i] = i;
int M = 5;
int fact = 0xCCCD;
int n = 18;
for(int i=0; i<N; i++) {
y[i] = x[i]/M;
z[i] = (fact*x[i])>>n;
if(y[i] != z[i]) printf("%d %d\n", y[i], z[i]);
}
}
There are several different methods to determine the magic number and n
. I use Agner Fog's Vector Class Library(VCL). It does this for you using SSE2 or AVX2 for 32-bit numbers (instead of the 15-bit numbers in the code above). If you want to see assembly code to do this his assembly library also does this for SSE2 (and maybe AVX2)
See page 22 of the VCL manual for more details. It's also described in the manual for his assembly library.