Search code examples
crccrc32crc16

Impact of data padding on CRC calculation


I am calculating CRC on a large chunk of data every cycle in hardware (64B per cycle). In order to parallelize the CRC calculation, I want to calculate the CRC for small data chunks and then XOR them in parallel.

Approach:

  1. We divide the data into small chunks (64B data divided into 8 chunks of 8B each).
  2. Then we calculate CRC's for all the chunks individually (8 CRC's in parallel for 8B chunks).
  3. Finally calculate the CRC for padded data. This answer points out that the CRC for padded data is obtained by multiplying the old CRC with x^n.

Hence, I am calculating the CRC for a small chunk of data, then multiply it with CRC of 0x1 shifted by 'i' times as shown below.

In short, I am trying to accomplish below: CRC approach

For example: CRC-8 on this site:

Input Data=(0x05 0x07) CRC=0x54
Step-1: Data=0x5 CRC=0x1B
Step-2: Data=0x7 CRC=0x15
Step-3: Data=(0x1 0x0) CRC=0x15
Step-4: Multiply step-1 CRC and step-3 CRC with primitive polynomial 0x7. So, I calculate (0x1B).(0x15) = (0x1 0xC7) mod 0x7.
Step-5: Calculate CRC Data=(0x1 0xC7) CRC=0x4E (I assume this is same as (0x1 0xC7) mod 0x7)
Step-6: XOR the result to get the final CRC. 0x4E^0x15=0x5B

As we can see, the result in step-6 is not the correct result.

Can someone help me how to calculate the CRC for padded data? Or where am I going wrong in the above example?


Solution

  • Rather than calculate and then adjust multiple CRC's, bytes of data can be carryless multiplied to form a set of 16 bit "folded" products, which are then xor'ed and a single modulo operation performed on the xor'ed "folded" products. An optimized modulo operation uses two carryless multiples, so it's avoided until all folded products have been generated and xor'ed together. A carryless multiply uses XOR instead of ADD and a borrowless divide uses XOR instead of SUB. Intel has a pdf file about this using the XMM instruction PCLMULQDQ (carryless multiply), where 16 bytes are read at a time, split into two 8 byte groups, with each group folded into a 16 byte product, and the two 16 byte products are xor'ed to form a single 16 byte product. Using 8 XMM registers to hold folding products, 128 bytes at time are processed. (256 bytes at at time in the case of AVX512 and ZMM registers).

    https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

    Assume your hardware can implement a carryless multiply that takes two 8 bit operands and produces a 16 bit (technically 15 bit) product.

    Let message = M = 31 32 33 34 35 36 37 38. In this case CRC(M) = C7

    pre-calculated constants (all values shown in hex):
    
    2^38%107 = DF    cycles forwards 0x38 bits
    2^30%107 = 29    cycles forwards 0x30 bits
    2^28%107 = 62    cycles forwards 0x28 bits
    2^20%107 = 16    cycles forwards 0x20 bits
    2^18%107 = 6B    cycles forwards 0x18 bits
    2^10%107 = 15    cycles forwards 0x10 bits
    2^08%107 = 07    cycles forwards 0x08 bits
    2^00%107 = 01    cycles forwards 0x00 bits
    
    16 bit folded (cycled forward) products (can be calculated in parallel):    
    31·DF = 16CF
    32·29 = 07E2
    33·62 = 0AC6
    34·16 = 03F8
    35·6B = 0A17
    36·15 = 038E
    37·07 = 0085
    38·01 = 0038
            ----
    V =     1137    the xor of the 8 folded products
    CRC(V) = 113700 % 107 = C7
    

    To avoid having to use borrowless divide for the modulo operation, CRC(V) can be computed using carryless multiply. For example

    V = FFFE
    CRC(V) = FFFE00 % 107 = 23.
    

    Implementation, again all values in hex (hex 10 = decimal 16), ⊕ is XOR.

    input:
    V = FFFE
    constants:
    P = 107                  polynomial
    I = 2^10 / 107 = 107     "inverse" of polynomial
                             by coincidence, it's the same value
    2^10 % 107 = 15          for folding right 16 bits
    fold the upper 8 bits of FFFE00 16 bits to the right:
    U = FF·15 ⊕ FE00 = 0CF3 ⊕ FE00 = F2F3  (check: F2F3%107 = 23 = CRC)
    Q = ((U>>8)·I)>>8 = (F2·107)>>8 = ...
    to avoid a 9 bit operand, split up 107 = 100 ⊕ 7 
    Q = ((F2·100) ⊕ (F2·07))>>8 = ((F2<<8) ⊕ (F2·07))>>8 = (F200 ⊕ 02DE)>>8 = F0DE>>8 = F0
    X = Q·P = F0·107 = F0·100 ⊕ F0·07 = F0<<8 ⊕ F0·07 = F000 ⊕ 02D0 = F2D0
    CRC = U ⊕ X = F2F3 ⊕ F2D0 = 23
    

    Since the CRC is 8 bits, there's no need for the upper 8 bits in the last two steps, but it doesn't help that much for the overall calculation.

    X = (Q·(P&FF))&FF = (F0·07)&FF = D0
    CRC = (U&FF) ⊕ X = F3 ⊕ D0 = 23
    

    Example program to generate 2^0x10 / 0x107 and powers of 2 % 0x107:

    #include <stdio.h>
    
    typedef unsigned char  uint8_t;
    typedef unsigned short uint16_t;
    
    #define poly 0x107
    
    uint16_t geninv(void)           /* generate 2^16 / 9 bit poly */
    {
    uint16_t q = 0x0000u;           /* quotient */
    uint16_t d = 0x0001u;           /* initial dividend = 2^0 */
        for(int i = 0; i < 16; i++){
            d <<= 1;
            q <<= 1;
            if(d&0x0100){           /* if bit 8 set */
                q |= 1;             /*   q |= 1 */
                d ^= poly;          /*   d ^= poly */
            }
        }
        return q;                   /* return inverse */
    }
    
    uint8_t powmodpoly(int n)       /* generate 2^n % 9 bit poly */
    {
    uint16_t d = 0x0001u;           /* initial dividend = 2^0 */
        for(int i = 0; i < n; i++){
            d <<= 1;                /* shift dvnd left */
            if(d&0x0100){           /* if bit 8 set */
                d ^= poly;          /*   d ^= poly */
            }
        }
        return (uint8_t)d;          /* return remainder */
    }
    
    int main()
    {
        printf("%04x\n", geninv());
        printf("%02x %02x %02x %02x %02x %02x %02x %02x %02x %02x\n",
               powmodpoly(0x00), powmodpoly(0x08), powmodpoly(0x10), powmodpoly(0x18),
               powmodpoly(0x20), powmodpoly(0x28), powmodpoly(0x30), powmodpoly(0x38),
               powmodpoly(0x40), powmodpoly(0x48));
        printf("%02x\n", powmodpoly(0x77));  /* 0xd9, cycles crc backwards 8 bits */
        return 0;
    }
    

    Long hand example for 2^0x10 / 0x107.

                                100000111    quotient
                      -------------------
    divisor 100000111 | 10000000000000000    dividend
                        100000111
                        ---------
                              111000000
                              100000111
                              ---------
                               110001110
                               100000111
                               ---------
                                100010010
                                100000111
                                ---------
                                    10101    remainder
    

    I don't know how many registers you can have in your hardware design, but assume there are five 16 bit registers used to hold folded values, and either two or eight 8 bit registers (depending on how parallel the folding is done). Then following the Intel paper, you fold values for all 64 bytes, 8 bytes at a time, and only need one modulo operation. Register size, fold# = 16 bits, reg# = 8 bits. Note that powers of 2 modulo poly are pre-calculated constants.

    foldv = prior buffer's folding value, equivalent to folded msg[-2 -1]
    reg0 = foldv>>8
    reg1 = foldv&0xFF
    foldv  = reg0·((2^0x18)%poly)    advance by 3 bytes
    foldv ^= reg1·((2^0x10)%poly)    advance by 2 bytes
    fold0 = msg[0 1] ^ foldv         handling 2 bytes at a time
    fold1 = msg[2 3]
    fold2 = msg[4 5]
    fold3 = msg[6 7]
    for(i = 8; i < 56; i += 8){
        reg0  = fold0>>8
        reg1  = fold0&ff
        fold0  = reg0·((2^0x48)%poly)    advance by 9 bytes
        fold0 ^= reg1·((2^0x40)%poly)    advance by 8 bytes
        fold0 ^= msg[i+0 i+1]
        reg2  = fold1>>8                 if not parallel, reg0
        reg3  = fold1&ff                  and reg1
        fold1  = reg2·((2^0x48)%poly)    advance by 9 bytes
        fold1 ^= reg3·((2^0x40)%poly)    advance by 8 bytes
        fold1 ^= msg[i+2 i+3]
        ...
        fold3 ^= msg[i+6 i+7]
    }
    reg0  = fold0>>8
    reg1  = fold0&ff
    fold0  = reg0·((2^0x38)%poly)    advance by 7 bytes
    fold0 ^= reg1·((2^0x30)%poly)    advance by 6 bytes
    reg2  = fold1>>8                 if not parallel, reg0
    reg3  = fold1&ff                  and reg1
    fold1  = reg2·((2^0x28)%poly)    advance by 5 bytes
    fold1 ^= reg3·((2^0x20)%poly)    advance by 4 bytes
    fold2 ...                        advance by 3 2 bytes
    fold3 ...                        advance by 1 0 bytes
    foldv = fold0^fold1^fold2^fold3
    

    Say the final buffer has 5 bytes:

    foldv = prior folding value, equivalent to folded msg[-2 -1]
    reg0 = foldv>>8
    reg1 = foldv&0xFF
    foldv  = reg0·((2^0x30)%poly)    advance by 6 bytes
    foldv ^= reg1·((2^0x28)%poly)    advance by 5 bytes
    fold0 = msg[0 1] ^ foldv
    reg0  = fold0>>8
    reg1  = fold0&ff
    fold0  = reg0·((2^0x20)%poly)    advance by 4 bytes
    fold0 ^= reg1·((2^0x18)%poly)    advance by 3 bytes
    fold1 = msg[2 3]
    reg2  = fold1>>8
    reg3  = fold1&ff
    fold1  = reg0·((2^0x10)%poly)    advance by 2 bytes
    fold1 ^= reg1·((2^0x08)%poly)    advance by 1 bytes
    fold2 = msg[4]                   just one byte loaded
    fold3 = 0
    foldv = fold0^fold1^fold2^fold3
    now use the method above to calculate CRC(foldv)