C Snippets for Algebra in Finite Prime Fields

I need to solve polynomials in finite prime fields on 16-bit CPUs. I have seen people using the fields GF((2^16)+1), GF((2^16)-15), and GF((2^32)-5). I guess these choices stem from the fact that there are several optimizations. However, apart from using "mod" I do not know any further optimizations. I would really appreciate if someone pointed me to a good resource, gave me code snippets, or explained why people are using those strange prime numbers instead of say GF((2^16)-1).

EDIT: %-free modulo in GF((2^16)+1):

uint32_t mod_0x10001(uint32_t divident)
{
  uint16_t least;
  uint16_t most;

  least = divident & 0xFFFF;
  most = divident >> 16;

  if (least >= most) {
    return least - most;
  } else {
    return 0x10001 + least - most;
  }
}

EDIT: %-free modulo in GF(2^16-15):

uint32_t mod_0xFFF1(uint32_t divident)
{
  uint16_t least;
  uint16_t most;
  uint32_t remainder;

  least = divident & 0xFFFF;
  most = divident >> 16;

  /**
   * divident mod 2^16-15
   * = (most * 2^N + least) mod 2^16-15
   * = [(most * 2^N mod 2^16-15) + (least mod 2^16-15)] mod 2^16-15
   * = [ 15 * most               + least              ] mod 2^16-15
   */
  remainder = 15 * most         + least;

  while (remainder >= 0xFFF1) {
      remainder -= 0xFFF1;
  }

  return remainder;
}

UPDATE: I measured the performonce on an MSP430: the upper version is 4 times faster than the lower version. The lower version is as fast as simply using % :/. Any further suggestions to speed up the lower version?

Cheers Konrad

Solution

The reason for using powers 2^N - m, where is small, is due to the fact, that calculating the modulo of the word of format (HI * 2^N + LO) mod 2^N-m can be split to two (or more pieces) that reduce to

    (HI*2^N+LO) mod (2^N-m) ==
    ((HI*2^N) mod (2^N-m) + LO mod (2^N-m)) mod (2^N-m)
    (m * HI  + LO ) mod (2^N-m).

The value m*HI + LO has at most log2(m) bits more than fits the computer word -- that log2(m) bit value can be again folded back to the sum by repeatedly multiplied with m and accumulated. Typically one iteration is enough.

If m is small, m^2 or m^3 can be reasonably small too -- then one can apply the technique to calculate modulus of a big-num:

    [AAAAA | BBBBB | CCCCC | DDDDD | EEEEE ] mod (2^N-m) ==
     EEEEE * 1 mod (2^N-m) +
     DDDDD * m mod (2^N-m) +
     CCCCC * (m^2) mod (2^N-m) + ... etc.

It's the same in base 10, where

    1234 5678 9812 mod 9997 ==
              9812 mod 9997 +
            3*5678 mod 9997 +
            9*1234 mod 9997 ==
            3 7952 mod 9997 == ( 7952 + 3*3 ) mod 9997 = 7961

    Here 9997 doesn't have to prime, we are using 10^4 instead of 2^N and m = 3

For GF(2^n) calculation there typical speedups are look-up-tables for root^n and log(n); then multiplication is reduced to addition. If the target system wasn't some 16-bit system, I'd propose using SSE4.2 (or Neon) polynomial (carry free) multiplication. If I'm not hugely mistaken, polynomial calculation in GF should be doable with convolution:

for (i=0;i<N*2-1;i++) temp[i]=popcount(A & (bit_reverse(B)<<N)>>i);
//  A = 11010, B=01101, reverse B = 10110
//
//  11010     11010     11010    11010   11010  11010   11010    11010     11010
//      10110    10110    10110   10110  10110 10110  10110   10110    10110
// 0000000000 00010000  0000000  010100  10010 001000 0011000 00010000 000000000
//      0        1         0      0       0      1      0        1        0
// 010001010 to be divided by the generator polynomial using typical CRC approach

Further reading for comparison of GF(2^n) multiplication:

(paper by Serdar S. Erdem, Tuğrul Yanık, Çetin K. Koç,
Acta Applicandae Mathematica September 2006, Volume 93, Issue 1-3, pp 33-55 )