Rijndael, Mixcolumns algorithm for numbers less than 0x1B, wikipedia C program wrong?

I am starting to implement AES 128 bit algorithm in FPGA and gathering information I found a C algorithm in wiki and wanted to check how the mixcolumns step works but I think it is not calculating the mod operation in the right way. The program below performs the b[c] ^= 0x1b & h; and for the variable h if it has the MSB set it makes it 0xFF otherwise 0x00 which I have not been able to understand why?, for any number higher that 0x7F will go through the XOR and the rest will be left as is, I assume because the MOD operation will give the same number as the divisor is higher than the dividend but I could say that it should not be less 0x7F but less than or equal 0x8A. Thanks in advance for your insight here.

Wikipedia Link.

 void mixCol(uint8_t * r)
 {
 uint8_t a[4];
 uint8_t b[4];
 uint8_t h;

 for(int c=0;c<4;c++)
 {
    a[c] = r[c];
    /* h is 0xff if the high bit of r[c] is set, 0 otherwise */
    h = (uint8_t)((int8_t)r[c] >> 7); /* arithmetic right shift, thus shifting in either zeros or ones */
    b[c] = r[c] << 1; /* implicitly removes high bit because b[c] is an 8-bit char, so we xor by 0x1b and not 0x11b in the next line */
    b[c] ^= 0x1B & h; /* Rijndael's Galois field, The xor is not applied when is less than 0x80 */
 }

    r[0] = b[0] ^ a[3] ^ a[2] ^ b[1] ^ a[1]; /* 2 * a0 + a3 + a2 + 3 * a1 */
    r[1] = b[1] ^ a[0] ^ a[3] ^ b[2] ^ a[2]; /* 2 * a1 + a0 + a3 + 3 * a2 */
    r[2] = b[2] ^ a[1] ^ a[0] ^ b[3] ^ a[3]; /* 2 * a2 + a1 + a0 + 3 * a3 */
    r[3] = b[3] ^ a[2] ^ a[1] ^ b[0] ^ a[0]; /* 2 * a3 + a2 + a1 + 3 * a0 */

    printf("%X  %X\n",b[0],h);
 }

int main()
{
uint8_t col1[4] = {0x80,0xbf,0x5d,0x80};
mixCol(col1);
return 0;
}

Solution

Rewritten for clarity (and correcting the modulo operation). Short answer:

Why does the modulo operation only examine one bit (bit 8, 0x80)?

The operation in question implements a GF(2⁸) multiplication by 2 with respect to reducing polynomial 0x11B.

Because the multiplicand has only the second bit set (bit 1; 2 = 2¹), the multiplication step reduces to a left shift by one bit. Because the other multiplicand is an 8-bit value, it is restricted to range 0..255 (0x00 to 0xFF), so the product is restricted to even values in range 0..510 (0x00 to 0x1FE).

The size of the divisor 0x11B is 9 bits, and the product of the multiplication has at most 9 bits, too. This means the modulo step reduces to a single bit check, too. Logically, the check is whether the value, after shifting left one bit, has the ninth bit set. (The reason why the test is & 0x100 after the bit shift, and not ≥ 0x11B after the bit shift, is because of the differences in the modulo operation in GF(2^N) and in normal arithmetic. If modulo operated in the same way as in normal arithmetic, then ≥ would indeed be used. But it does not.)

Instead of operating on larger than 8-bit variables, implementations check the eighth bit, prior to the shift, instead. Because the result is also limited to eight bits, the operation reduces to

result = (byte & 0x80) ? (byte << 1) ^ 0x1B : (byte << 1);

Remember, in GF(2^N) ~~no-one can hear you scream~~ arithmetic operations like multiplication, addition, subtraction, division, and modulus are different to what we are used to.

In GF(2^N) arithmetic, each value v (a nonnegative integer) represents a polynomial. If we label the binary digits of v as b_k, then

v = 2^N-1b_N-1 + 2^N-1b_N-1 + .. + 2¹b₁ + 2⁰b₀

representing polynomial

b_N-1 x^N-1 + b_N-2 x^N-2 + ... + b₁ x¹ + b₀ x⁰

Because the coefficients b_k can only be 0 or 1, this is a characteristic 2 finite field with 2^N elements, also called a Galois field GF(2^N).

The key thing to understand is that we do not care what value x has; we do not even care what type x is, other than that x⁰ = 1. We really do operate on the polynomials represented by the nonnegative integer values v, and therefore our basic arithmetic operations -- addition (and subtraction), multiplication, and division (and therefore also modulo operation) -- work differently on v than they do in regular arithmetic.

Because the polynomial coefficients b_k are binary, we have

x^k + x^k = 0

and not 2x^k nor x^k+1. Our basic arithmetic operations do not "carry" to the next higher coefficient (binary digit). (I will refer to this as "no-carry" below.)

Addition and subtraction (+ and -):

In GF(2^N), these are done using the exclusive-or operation on v. (When there are multiple terms, the result is essentially the even parity of the inputs; 0 if there are an even number of 1 binary digits, and 1 if odd.)

unsigned long gf2n_add(const unsigned long a, const unsigned long b)
{
    return a ^ b;
}

unsigned long gf2n_sub(const unsigned long a, const unsigned long b)
{
    return a ^ b;
}

Multiplication (⋅):

Because of polynomial multiplication rules, we can describe GF(2^N) multiplication in terms of long multiplication, except that instead of addition, we perform an exclusive-or operation.

/* Note: we ignore overflow, and instead assume that a and b
         are small enough for the result to not overflow. */
unsigned long gf2n_mul(unsigned long a, const unsigned long b)
{
    unsigned long  result = 0UL;

    while (b) {

        if (b & 1)
            result ^= a;

        a <<= 1;
        b >>= 1;
    }

    return result;
}

Division and modulus (/ and %):

For readability, I shall use ∅ for GF(2^N) modulo operation in this answer. (You probably won't see the null set character used that way anywhere else; I just didn't think of anything that would work better.)

Division in GF(2^N) is implemented as multiplication by the inverse, and is quite complicated; fortunately, we do not need it here. However, it does affect the modulo operation, which we do need.

The Wikipedia article on finite field arithmetic shows an example of modulo operation in GF(2^N). It is very similar to long division.

We start with the original dividend copied to the result. As long as the result has at least as many binary digits as the divisor, we exclusive-OR the result with the divisor shifted left so that its highest bit set is aligned to the highest bit set in the result.

Note that because of no-carry, we do not use the comparison operator (≥) here, but only look at the highest bit set in the result.

Because the exclusive-OR operation affects the bits that we check, we must check the bits in decreasing order. Rather than repeatedly shift the divisor left, it is easier to use a temporary value that we shift left once, and then have a loop that shifts the temporary value right by one bit, and tests one bit per loop:

/* ceil(log2(v+1)) */
int bits(unsigned long v)
{
    int result = 0;

    while (v) {
        v >>= 1;
        result++;
    }
}

unsigned long mod(unsigned long v, unsigned long m)
{
    /* mod(v,0) is undefined; we'll just return 0. */
    if (!m)
        return v;

    if (v > m) {
        const int      v_size = bits(v);
        const int      m_size = bits(m);

        /* High has only the highest bit in v set. */
        unsigned long  high = 1UL << (v_size - 1);

        /* Mask is m shifted left so that its highest
           bit matches the value of high. */
        unsigned long  mask = m << (v_size - m_size);

        /* Number of bits we need to examine. */
        int            i = v_size - m_size + 1;

        /* As an example, if
               v = 110101  in binary
               m = 101     in binary
           then at this point,
               v_size = 6
               m_size = 3
               high = 100000  in binary
               mask = 101000
               i = 4.

           For i steps:
               If v has the bit corresponding to high set,
               i.e. (v & high) is nonzero, then:
                   v = v exclusive-or mask
               shift high right
               shift mask right
           Return v
        */

        while (i--) {
            if (high & v)
                v ^= mask;
            high >>= 1;
            mask >>= 1;
        }

        return v;
    }

    if (v < m)
        return v;

    /* v == m */
    return 0UL;
}

An example surely clarifies the idea. Let's calculate (in binary) 1010001 ∅ 1001 (in decimal, this is 81 ∅ 9). Note that the highest bit set in the (intermediate) result determines whether an exclusive-or is done; the values they represent are not compared:

      1010001
    ^ 1001
    ──┴───────
      0011001
    ^   1001
    ────┴─────
        01011
    ^    1001
    ─────┴────
         0010  (2 digits; less than 4), so
    ══════════
           10

Therefore, 1010001 ∅ 1001 = 10 (in binary); in decimal, 81 ∅ 9 = 2. (By normal arithmetic rules, 81 % 9 = 0.)

It is important to notice that this is done from high bits down to the low bits, and the check involves only the most significant bit left; we do not compare the values using > or ≥. There are (the number of binary digits in dividend) - (the number of binary digits in divisor) + 1 bits checked, and as many (and up to as many exclusive-or^ operations are opeexclusive-OR operations done, overall.

Multiplication (×) with respect to an irreducible polynomial represented by c:

In Rijndael, multiplication is done with respect to an irreducible polynomial with respect to polynomial x⁸ + x⁴ + x³ + x¹ + x⁰, represented by value 283 (in decimal, 0x11B in hexadecimal).

In essence, a × b (with respect to c) is just

(a ⋅ b) ∅ c

using the definitions above for GF(2^N) arithmetic.

For clarification:
* and ^ denote the normal arithmetic operations,
<< is bit shift left (corresponds to arithmetic multiplication by 2), and >> is bit shift right,
⋅ is still used for GF(2^N) multiplication,
× for GF(2^N) multiplication with respect to some reducing polynomial,
∅ for GF(2^N) modulo operation, and
+ and - refer to normal arithmetic addition and subtraction.

Let's look at the function the OP is trying to implement, and its definition at the Rijndael mix columns operation.

We can see that we only ever multiply (with respect to Rijndael polynomial represented by 0x11B) each value by 1 (representing polynomial x⁰), 2 (representing polynomial x¹), or 3 (representing polynomial x¹ + x⁰):

⎡ r0 ⎤   ⎡ 2 3 1 1 ⎤ ⎡ a0 ⎤
⎢ r1 ⎥ = ⎢ 1 2 3 1 ⎥ ⎢ a1 ⎥ 
⎢ r2 ⎥   ⎢ 1 1 2 3 ⎥ ⎢ a2 ⎥ 
⎣ r3 ⎦   ⎣ 3 1 1 2 ⎦ ⎣ a3 ⎦

Expanding the operation above according to GF(2^N) arithmetic rules, we have

r0 = (a0 × 2)  ^  (a1 × 3)  ^  (a2 × 1)  ^  (a3 × 1)
r1 = (a0 × 1)  ^  (a1 × 2)  ^  (a2 × 3)  ^  (a3 × 1)
r2 = (a0 × 1)  ^  (a1 × 1)  ^  (a2 × 2)  ^  (a3 × 3)
r3 = (a0 × 3)  ^  (a1 × 1)  ^  (a2 × 1)  ^  (a3 × 2)

Note that in GF(2^N) for N ≥ 2, 3 = 1 ^ 2, which means a3 × 3 = a3 ^ (a3 × 2) = (a3 × 2) ^ a3. If we use temporary variable b0 to represent a0 × 2, we can use

a0 × 1 = a0
a0 × 2 = b0
a0 × 3 = b0 ^ a0

and similarly for a1, a2, and a3. The function then simplifies to

b0 = a0 × 2
b1 = a1 × 2
b2 = a2 × 2
b3 = a3 × 2

r0 = (      b0 ) ^ ( a1 ^ b1 ) ^ ( a2      ) ^ ( a3      )
r1 = ( a0      ) ^ (      b1 ) ^ ( a2 ^ b2 ) ^ ( a3      )
r2 = ( a0      ) ^ ( a1      ) ^ (      b2 ) ^ ( a3 ^ b3 )
r3 = ( a0 ^ b0 ) ^ ( a1      ) ^ ( a2      ) ^ (      b3 )

and since ^ is commutative (order does not matter), we can rewrite the result as

r0 = a1 ^ a2 ^ a3 ^ b0 ^ b1
r1 = a0 ^ a2 ^ a3 ^ b1 ^ b2
r2 = a0 ^ a1 ^ a3 ^ b2 ^ b3
r3 = a0 ^ a1 ^ a2 ^ b3 ^ b0

Let's look at how to compute a0 × 2, a1 × 2, a2 × 2, and a3 × 2, with respect to reducing polynomial 0x11b. Remember, by the definition of ×, these are equivalent to (a0 ⋅ 2) ∅ 0x11b, (a1 ⋅ 2) ∅ 0x11b, (a2 ⋅ 2) ∅ 0x11b, and (a3 ⋅ 2) ∅ 0x11b, respectively.

The ⋅ 2 part is simple, as it effectively just shifts the value one binary digit left. Because a0..a3 are 8-bit values, within 0x00 and 0xFF, inclusive, the maximum value after the shift is 0x1FE (or 111111110 in binary).

Now, remember how the modulo operation is implemented in GF(2^N). Here, the dividend has either the same number of binary digits as the reducing polynomial, or fewer; thus, the loop reduces to a single bit test:

b0 = a0 << 1;
b1 = a1 << 1;
b2 = a2 << 1;
b3 = a3 << 1;
if (b0 & 0x100) b0 ^= 0x11B;
if (b1 & 0x100) b1 ^= 0x11B;
if (b2 & 0x100) b2 ^= 0x11B;
if (b3 & 0x100) b3 ^= 0x11B;

except that since b0 .. b3 are only 8 bits in size, the above won't work. The solution is, of course, to do the bit test prior to the shift. Since the ninth bit (0x100) will be dropped anyway, exclusive-oring with 0x11B is equivalent to 0x1B:

if (a0 & 0x80)
    b0 = (a0 << 1) ^ 0x1B;
else
    b0 = (a0 << 1);

and similarly for b1, b2, and b3.

Unfortunately, if clauses are relatively slow on most common hardware. To speed the operation up, we can use an expression that yields 0x1B if bit 7 is set in the argument, and zero otherwise:

#define F(v) ((v & 0x80) ? 0x1B : 0x00)

then

b0 = (a0 << 1) ^ F(a0);
b1 = (a1 << 1) ^ F(a1);
b2 = (a2 << 1) ^ F(a2);
b3 = (a3 << 1) ^ F(a3);

On architectures where negative integers are represented using two's complement, we can avoid the ternary operator (an if in disguise, really), because casting the value to a signed 8-bit type, dividing it by 128 (which all sane C compilers will implement as a right shift by 7 bits; the result of a signed binary right shifts in C is implementation-dependent), and casting back to unsigned 8-bit integer type, yields 0xFF if the high bit was set, and 0x00 otherwise. Fortunately, if a C compiler supports the int8_t type, the standard says it must use two's complement!

#define NMASK(v) ((uint8_t)((int8_t)(v) / 128))

b0 = (a0 << 1) ^ (NMASK(a0) & 0x1B);
b1 = (a1 << 1) ^ (NMASK(a1) & 0x1B);
b2 = (a2 << 1) ^ (NMASK(a2) & 0x1B);
b3 = (a3 << 1) ^ (NMASK(a3) & 0x1B);

(Note that if implemented in hardware, (NMASK(v) & 0x1B) == (NMASK(v & 0x80) & 0x1B), i.e. only examines one input bit, and produces five bits, all either zero or one.

The Wikipedia implementation example, and the OP, do exactly this. Essentially:

a0 = input_byte_0
a1 = input_byte_1
a2 = input_byte_2
a3 = input_byte_3

b0 = (a0 << 1) ^ (0x1B & (uint8_t)( (int8_t)a0 / 128 ))
b1 = (a1 << 1) ^ (0x1B & (uint8_t)( (int8_t)a1 / 128 ))
b2 = (a2 << 1) ^ (0x1B & (uint8_t)( (int8_t)a2 / 128 ))
b3 = (a3 << 1) ^ (0x1B & (uint8_t)( (int8_t)a3 / 128 ))

output_byte_0 = a1 ^ a2 ^ a3 ^ b0 ^ b1
output_byte_1 = a0 ^ a2 ^ a3 ^ b1 ^ b2
output_byte_2 = a0 ^ a1 ^ a3 ^ b2 ^ b3
output_byte_3 = a0 ^ a1 ^ a2 ^ b3 ^ b0

When expanding the bytes to individual bits in an FPGA implementation, I am not sure whether one should expand the expression for each individual output bit (this should be straightforward; I'd personally write an awk script or something to do the hard work for me), or whether to use intermediate states to reduce the size of the look-up tables. I haven't had the opportunity to play with FPGA's yet, myself.