Addition and multiplication in a Galois Field

I am attempting to generate QR codes on an extremely limited embedded platform. Everything in the specification seems fairly straightforward except for generating the error correction codewords. I have looked at a bunch of existing implementations, and they all try to implement a bunch of polynomial math that goes straight over my head, particularly with regards to the Galois fields. The most straightforward way I can see, both in mathematical complexity and in memory requirements is a circuit concept that is laid out in the spec itself:

circuit diagram

With their description, I am fairly confident I could implement this with the exception of the parts labeled GF(256) addition and GF(256) Multiplication.

They offer this help:

The polynomial arithmetic for QR Code shall be calculated using bit-wise modulo 2 arithmetic and byte-wise modulo 100011101 arithmetic. This is a Galois field of 2^8 with 100011101 representing the field's prime modulus polynomial x^8+x^4+x^3+x^2+1.

which is all pretty much greek to me.

So my question is this: What is the easiest way to perform addition and multiplication in this kind of Galois field arithmetic? Assume both input numbers are 8 bits wide, and my output needs to be 8 bits wide also. Several implementations precalculate, or hardcode in two lookup tables to help with this, but I am not sure how those are calculated, or how I would use them in this situation. I would rather not take the 512 byte memory hit for the two tables, but it really depends on what the alternative is. I really just need help understanding how to do a single multiplication and addition operation in this circuit.

Solution

In practice only one table is needed. That would be for the GP(256) multiply. Note that all arithmetic is carry-less, meaning that there is no carry-propagation.

Addition and subtraction without carry is equivalent to an xor.

So in GF(256), a + b and a - b are both equivalent to a xor b.

GF(256) multiplication is also carry-less, and can be done using carry-less multiplication in a similar way with carry-less addition/subtraction. This can be done efficiently with hardware support via say Intel's CLMUL instruction set.

However, the hard part, is reducing the modulo 100011101. In normal integer division, you do it using a series of compare/subtract steps. In GF(256), you do it in a nearly identical manner using a series of compare/xor steps.

In fact, it's bad enough where it's still faster to just precompute all 256 x 256 multiplies and put them into a 65536-entry look-up table.

page 3 of the following pdf has a pretty good reference on GF256 arithmetic:

http://www.eecs.harvard.edu/~michaelm/CS222/eccnotes.pdf