Search code examples
ccrccrc16coding-efficiency

Understanting an efficient CRC-CCITT-16 implementation


I came across an allegedly very efficient and elegant CRC implementation and I am trying to really understand all the steps. I understand the CRC-CCITT 0x1021 implementations that iterate over each bit, but I am struggling to get this one. This is the code.

/*
* Original Code by Ashley Roll 
*/

uint16_t crc16(uint8_t* data, uint16_t length){

  uint8_t x;
  uint16_t crc = 0xFFFF;

  while (length--){
      x = crc >> 8 ^ *data++;
      x ^= x>>4;
      crc = (crc << 8) ^ ((uint16_t)(x << 12)) ^ ((uint16_t)(x <<5)) ^ ((uint16_t)x);
  }

  return crc;
}

Could anyone explain to me a little bit deeper what is going on with the x variable? This is what I was able to grasp until now

x = crc >> 8 ^ *data++; // Here, we take the high byte of the register
                        // and we XOR it with the next 8 bits of data. Why?
x ^= x>>4;              // Then, we XOR it with its last four bits?
                        // And we always remain with 4 possible non-zero bits, right?
                        // Why do we do this?
crc = (crc << 8) ^ ((uint16_t)(x << 12)) ^ ((uint16_t)(x <<5)) ^ ((uint16_t)x);
                        // Finally, we drop the oldest (highest) byte from the register
                        // And XOR it with rotations of x, according to the polynomial
                        // 0x1021, which is x^16 + x^12 + x^5 + 1
                        // Is (16-12) = 4 the reason why x has 4 possible non-zero bits?

I guess that this algorithm is just the equivalent of 8 loops of a bitwise algorithm but I would appreciate some clarification. Thanks for your time.


Solution

  • It would have helped if the code included a comment that x is the quotient to be used to process 8 bits at a time using a binary borrowless divide.

    The crc divides the data + 16 pad bits of zero by poynomial, and the remainder is the CRC.

    poly = x^16 + x^12 + x^5 + 1 = 0x11021 = 10001000000100001 (binary)
    

    To process 8 bits at a time, each step divides the bits aaaabbbb0000000000000000 by 10001000000100001, where aaaabbbb is the upper 8 bits of the crc xor'ed with the next data byte. This can be done in a single divide step by noting the quotient = x = (aaaabbbb)^(aaaabbbb>>4) = aaaacccc, where c = a^b, so a^c = b, and a^b^c = 0:

                                        aaaacccc
                      --------------------------
    10001000000100001 | aaaabbbb0000000000000000
                                        aaaacccc
                                   aaaacccc
                            aaaacccc
                        aaaacccc
                                ----------------
                                cccbaaacbbbacccc
    

    In the questions code, the bits above x^16 are not generated, since it is known they will cancel out bits x^16 through x^23. The left shift of 12 shifts off the upper 4 bits, corresponding to bits x^16 to x^19, and there is no left shift of 16.


    example: data = 0x31 0x32 = 00110001 00110010
    
    crc = 1111111111111111
    x = (crc>>8)^00110001 = 11001110
    q = (x)^(x>>4) = 11000010
    
                                       11000010
                      -------------------------
    10001000000100001 |110011100000000000000000
                                       11000010
                                  11000010
                           11000010
                       11000010
                               ----------------
                               0011100010000010
    
    crc = (crc<<8)^0011100010000010 = 1100011110000010
    x = (crc>>8) ^ 00110010 = 11110101
    q = (x)^(x>>4) = 11111010
    
                                       11111010
                      -------------------------
    10001000000100001 |111101010000000000000000
                                       11111010
                                  11111010
                           11111010
                       11111010
                               ----------------
                               1011111110111010
    
    crc = (crc<<8)^1011111110111010 = 0011110110111010 = 0x3dba
    

    As commented, a table look up should be faster. If a cpu has a fast carryless multiply, such as X86 pclmulqdq, then a faster still assembly program can be used, but it's over 500 lines long (to "fold" 128 bytes at a time in parallel using xmm registers). The assembly code below doesn't document the constants (rk...); they are powers of 2 modulo the polynomial used for "folding", except rk7 = "(2^n)/polynomial", and rk8 = polynomial.

    https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

    https://github.com/intel/isa-l/blob/master/crc/crc16_t10dif_01.asm

    I converted the code to Visual Studio ML64.EXE (MASM) and Windows, and have source code for crc16, crc32, crc64, non-reflected (non-reversed) and reflected).