I came across an allegedly very efficient and elegant CRC implementation and I am trying to really understand all the steps. I understand the CRC-CCITT 0x1021 implementations that iterate over each bit, but I am struggling to get this one. This is the code.
/*
* Original Code by Ashley Roll
*/
uint16_t crc16(uint8_t* data, uint16_t length){
uint8_t x;
uint16_t crc = 0xFFFF;
while (length--){
x = crc >> 8 ^ *data++;
x ^= x>>4;
crc = (crc << 8) ^ ((uint16_t)(x << 12)) ^ ((uint16_t)(x <<5)) ^ ((uint16_t)x);
}
return crc;
}
Could anyone explain to me a little bit deeper what is going on with the x variable? This is what I was able to grasp until now
x = crc >> 8 ^ *data++; // Here, we take the high byte of the register
// and we XOR it with the next 8 bits of data. Why?
x ^= x>>4; // Then, we XOR it with its last four bits?
// And we always remain with 4 possible non-zero bits, right?
// Why do we do this?
crc = (crc << 8) ^ ((uint16_t)(x << 12)) ^ ((uint16_t)(x <<5)) ^ ((uint16_t)x);
// Finally, we drop the oldest (highest) byte from the register
// And XOR it with rotations of x, according to the polynomial
// 0x1021, which is x^16 + x^12 + x^5 + 1
// Is (16-12) = 4 the reason why x has 4 possible non-zero bits?
I guess that this algorithm is just the equivalent of 8 loops of a bitwise algorithm but I would appreciate some clarification. Thanks for your time.
It would have helped if the code included a comment that x is the quotient to be used to process 8 bits at a time using a binary borrowless divide.
The crc divides the data + 16 pad bits of zero by poynomial, and the remainder is the CRC.
poly = x^16 + x^12 + x^5 + 1 = 0x11021 = 10001000000100001 (binary)
To process 8 bits at a time, each step divides the bits aaaabbbb0000000000000000 by 10001000000100001, where aaaabbbb is the upper 8 bits of the crc xor'ed with the next data byte. This can be done in a single divide step by noting the quotient = x = (aaaabbbb)^(aaaabbbb>>4) = aaaacccc, where c = a^b, so a^c = b, and a^b^c = 0:
aaaacccc
--------------------------
10001000000100001 | aaaabbbb0000000000000000
aaaacccc
aaaacccc
aaaacccc
aaaacccc
----------------
cccbaaacbbbacccc
In the questions code, the bits above x^16 are not generated, since it is known they will cancel out bits x^16 through x^23. The left shift of 12 shifts off the upper 4 bits, corresponding to bits x^16 to x^19, and there is no left shift of 16.
example: data = 0x31 0x32 = 00110001 00110010
crc = 1111111111111111
x = (crc>>8)^00110001 = 11001110
q = (x)^(x>>4) = 11000010
11000010
-------------------------
10001000000100001 |110011100000000000000000
11000010
11000010
11000010
11000010
----------------
0011100010000010
crc = (crc<<8)^0011100010000010 = 1100011110000010
x = (crc>>8) ^ 00110010 = 11110101
q = (x)^(x>>4) = 11111010
11111010
-------------------------
10001000000100001 |111101010000000000000000
11111010
11111010
11111010
11111010
----------------
1011111110111010
crc = (crc<<8)^1011111110111010 = 0011110110111010 = 0x3dba
As commented, a table look up should be faster. If a cpu has a fast carryless multiply, such as X86 pclmulqdq, then a faster still assembly program can be used, but it's over 500 lines long (to "fold" 128 bytes at a time in parallel using xmm registers). The assembly code below doesn't document the constants (rk...); they are powers of 2 modulo the polynomial used for "folding", except rk7 = "(2^n)/polynomial", and rk8 = polynomial.
https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_01.asm
I converted the code to Visual Studio ML64.EXE (MASM) and Windows, and have source code for crc16, crc32, crc64, non-reflected (non-reversed) and reflected).