I need to solve polynomials in finite prime fields on 16-bit CPUs. I have seen people using the fields GF((2^16)+1), GF((2^16)-15)
, and GF((2^32)-5)
. I guess these choices stem from the fact that there are several optimizations. However, apart from using "mod" I do not know any further optimizations. I would really appreciate if someone pointed me to a good resource, gave me code snippets, or explained why people are using those strange prime numbers instead of say GF((2^16)-1)
.
EDIT: %-free modulo in GF((2^16)+1):
uint32_t mod_0x10001(uint32_t divident)
{
uint16_t least;
uint16_t most;
least = divident & 0xFFFF;
most = divident >> 16;
if (least >= most) {
return least - most;
} else {
return 0x10001 + least - most;
}
}
EDIT: %-free modulo in GF(2^16-15):
uint32_t mod_0xFFF1(uint32_t divident)
{
uint16_t least;
uint16_t most;
uint32_t remainder;
least = divident & 0xFFFF;
most = divident >> 16;
/**
* divident mod 2^16-15
* = (most * 2^N + least) mod 2^16-15
* = [(most * 2^N mod 2^16-15) + (least mod 2^16-15)] mod 2^16-15
* = [ 15 * most + least ] mod 2^16-15
*/
remainder = 15 * most + least;
while (remainder >= 0xFFF1) {
remainder -= 0xFFF1;
}
return remainder;
}
UPDATE: I measured the performonce on an MSP430: the upper version is 4 times faster than the lower version. The lower version is as fast as simply using % :/. Any further suggestions to speed up the lower version?
Cheers Konrad
The reason for using powers 2^N - m, where is small, is due to the fact, that calculating the modulo of the word of format (HI * 2^N + LO) mod 2^N-m can be split to two (or more pieces) that reduce to
(HI*2^N+LO) mod (2^N-m) ==
((HI*2^N) mod (2^N-m) + LO mod (2^N-m)) mod (2^N-m)
(m * HI + LO ) mod (2^N-m).
The value m*HI + LO has at most log2(m) bits more than fits the computer word -- that log2(m) bit value can be again folded back to the sum by repeatedly multiplied with m and accumulated. Typically one iteration is enough.
If m is small, m^2 or m^3 can be reasonably small too -- then one can apply the technique to calculate modulus of a big-num:
[AAAAA | BBBBB | CCCCC | DDDDD | EEEEE ] mod (2^N-m) ==
EEEEE * 1 mod (2^N-m) +
DDDDD * m mod (2^N-m) +
CCCCC * (m^2) mod (2^N-m) + ... etc.
It's the same in base 10, where
1234 5678 9812 mod 9997 ==
9812 mod 9997 +
3*5678 mod 9997 +
9*1234 mod 9997 ==
3 7952 mod 9997 == ( 7952 + 3*3 ) mod 9997 = 7961
Here 9997 doesn't have to prime, we are using 10^4 instead of 2^N and m = 3
For GF(2^n) calculation there typical speedups are look-up-tables for root^n and log(n); then multiplication is reduced to addition. If the target system wasn't some 16-bit system, I'd propose using SSE4.2 (or Neon) polynomial (carry free) multiplication. If I'm not hugely mistaken, polynomial calculation in GF should be doable with convolution:
for (i=0;i<N*2-1;i++) temp[i]=popcount(A & (bit_reverse(B)<<N)>>i);
// A = 11010, B=01101, reverse B = 10110
//
// 11010 11010 11010 11010 11010 11010 11010 11010 11010
// 10110 10110 10110 10110 10110 10110 10110 10110 10110
// 0000000000 00010000 0000000 010100 10010 001000 0011000 00010000 000000000
// 0 1 0 0 0 1 0 1 0
// 010001010 to be divided by the generator polynomial using typical CRC approach
Further reading for comparison of GF(2^n) multiplication:
(paper by Serdar S. Erdem, Tuğrul Yanık, Çetin K. Koç,
Acta Applicandae Mathematica
September 2006, Volume 93, Issue 1-3, pp 33-55 )