What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,5,... up to 16?
WARNING: given code is EXAMPLE only. Every line is independent! Just obvious solution using modulo operation is slow on many processors, which don't have DIV hardware (like many ARMs). Some compilers are also cannot make such optimizations (say, if divisor is a function's argument or is dependent on something).
Divisible_by_1 = do();
Divisible_by_2 = if (!(number & 1)) do();
Divisible_by_3 = ?
Divisible_by_4 = ?
Divisible_by_5 = ?
Divisible_by_6 = ?
Divisible_by_7 = ?
Divisible_by_8 = ?
Divisible_by_9 = ?
Divisible_by_10 = ?
Divisible_by_11 = ?
Divisible_by_12 = ?
Divisible_by_13 = ?
Divisible_by_14 = ?
Divisible_by_15 = ?
Divisible_by_16 = if(!number & 0x0000000F) do();
and special cases:
Divisible_by_2k = if(number & (tk-1)) do(); //tk=2**k=(2*2*2*...) k times
It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.
In this document Torbjörn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).
The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.
Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.
It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.
Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.