Search code examples
c++polynomial-mathcrc32

Implementing SSE 4.2's CRC32C in software


So I have a design which incorporates CRC32C checksums to ensure data hasn't been damaged. I decided to use CRC32C because I can have both a software version and a hardware-accelerated version if the computer the software runs on supports SSE 4.2

I'm going by Intel's developer manual (vol 2A), which seems to provide the algorithm behind the crc32 instruction. However, I'm having little luck. Intel's developer guide says the following:

BIT_REFLECT32: DEST[31-0] = SRC[0-31]
MOD2: Remainder from Polynomial division modulus 2

TEMP1[31-0] <- BIT_REFLECT(SRC[31-0])
TEMP2[31-0] <- BIT_REFLECT(DEST[31-0])
TEMP3[63-0] <- TEMP1[31-0] << 32
TEMP4[63-0] <- TEMP2[31-0] << 32
TEMP5[63-0] <- TEMP3[63-0] XOR TEMP4[63-0]
TEMP6[31-0] <- TEMP5[63-0] MOD2 0x11EDC6F41
DEST[31-0]  <- BIT_REFLECT(TEMP6[31-0])

Now, as far as I can tell, I've done everything up to the line starting TEMP6 correctly, but I think I may be either misunderstanding the polynomial division, or implementing it incorrectly. If my understanding is correct, 1 / 1 mod 2 = 1, 0 / 1 mod 2 = 0, and both divides-by-zero are undefined.

What I don't understand is how binary division with 64-bit and 33-bit operands will work. If SRC is 0x00000000, and DEST is 0xFFFFFFFF, TEMP5[63-32] will be all set bits, while TEMP5[31-0] will be all unset bits.

If I was to use the bits from TEMP5 as the numerator, there would be 30 divisions by zero as the polynomial 11EDC6F41 is only 33 bits long (and so converting it to a 64-bit unsigned integer leaves the top 30 bits unset), and so the denominator is unset for 30 bits.

However, if I was to use the polynomial as the numerator, the bottom 32 bits of TEMP5 are unset, resulting in divides by zero there, and the top 30 bits of the result would be zero, as the top 30 bits of the numerator would be zero, as 0 / 1 mod 2 = 0.

Am I misunderstanding how this works? Just plain missing something? Or has Intel left out some crucial step in their documentation?

The reason I went to Intel's developer guide for what appeared to be the algorithm they used is because they used a 33-bit polynomial, and I wanted to make outputs identical, which didn't happen when I used the 32-bit polynomial 1EDC6F41 (show below).

uint32_t poly = 0x1EDC6F41, sres, crcTable[256], data = 0x00000000;

for (n = 0; n < 256; n++) {
    sres = n;
    for (k = 0; k < 8; k++)
        sres = (sres & 1) == 1 ? poly ^ (sres >> 1) : (sres >> 1);
    crcTable[n] = sres;
}
sres = 0xFFFFFFFF;

for (n = 0; n < 4; n++) {
    sres = crcTable[(sres ^ data) & 0xFF] ^ (sres >> 8);
}

The above code produces 4138093821 as an output, and the crc32 opcode produces 2346497208 using the input 0x00000000.

Sorry if this is badly written or incomprehensible in places, it is rather late for me.


Solution

  • Here are both software and hardware versions of CRC-32C. The software version is optimized to process eight bytes at a time. The hardware version is optimized to run three crc32q instructions effectively in parallel on a single core, since the throughput of that instruction is one cycle, but the latency is three cycles.

    crc32c.c:

    /* crc32c.c -- compute CRC-32C using the Intel crc32 instruction
     * Copyright (C) 2013, 2021, 2023 Mark Adler
     * Version 1.3  7 Dec 2023  Mark Adler
     */
    
    /*
      This software is provided 'as-is', without any express or implied
      warranty. In no event will the author be held liable for any damages
      arising from the use of this software.
    
      Permission is granted to anyone to use this software for any purpose,
      including commercial applications, and to alter it and redistribute it
      freely, subject to the following restrictions:
    
      1. The origin of this software must not be misrepresented; you must not
         claim that you wrote the original software. If you use this software
         in a product, an acknowledgment in the product documentation would be
         appreciated but is not required.
      2. Altered source versions must be plainly marked as such, and must not be
         misrepresented as being the original software.
      3. This notice may not be removed or altered from any source distribution.
    
      Mark Adler
      [email protected]
     */
    
    /* Version History:
     1.0  10 Feb 2013  First version
     1.1  31 May 2021  Correct register constraints on assembly instructions
                       Include pre-computed tables to avoid use of pthreads
                       Return zero for the CRC when buf is NULL, as initial value
     1.2   5 Jun 2021  Make tables constant
     1.3   7 Dec 2023  Improve register constraints for optimization
     */
    
    // Use hardware CRC instruction on Intel SSE 4.2 processors. This computes a
    // CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc. A software
    // version is provided as a fall-back, as well as for speed comparisons.
    
    #include <stddef.h>
    #include <stdint.h>
    
    // Tables for CRC word-wise calculation, definitions of LONG and SHORT, and CRC
    // shifts by LONG and SHORT bytes.
    #include "crc32c.h"
    
    // Table-driven software version as a fall-back. This is about 15 times slower
    // than using the hardware instructions. This assumes little-endian integers,
    // as is the case on Intel processors that the assembler code here is for.
    static uint32_t crc32c_sw(uint32_t crc, void const *buf, size_t len) {
        if (buf == NULL)
            return 0;
        unsigned char const *data = buf;
        while (len && ((uintptr_t)data & 7) != 0) {
            crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
            len--;
        }
        size_t n = len >> 3;
        for (size_t i = 0; i < n; i++) {
            uint64_t word = crc ^ ((uint64_t const *)data)[i];
            crc = crc32c_table[7][word & 0xff] ^
                  crc32c_table[6][(word >> 8) & 0xff] ^
                  crc32c_table[5][(word >> 16) & 0xff] ^
                  crc32c_table[4][(word >> 24) & 0xff] ^
                  crc32c_table[3][(word >> 32) & 0xff] ^
                  crc32c_table[2][(word >> 40) & 0xff] ^
                  crc32c_table[1][(word >> 48) & 0xff] ^
                  crc32c_table[0][word >> 56];
        }
        data += n << 3;
        len &= 7;
        while (len) {
            len--;
            crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
        }
        return crc;
    }
    
    // Apply the zeros operator table to crc.
    static uint32_t crc32c_shift(uint32_t const zeros[][256], uint32_t crc) {
        return zeros[0][crc & 0xff] ^ zeros[1][(crc >> 8) & 0xff] ^
               zeros[2][(crc >> 16) & 0xff] ^ zeros[3][crc >> 24];
    }
    
    // Compute CRC-32C using the Intel hardware instruction. Three crc32q
    // instructions are run in parallel on a single core. This gives a
    // factor-of-three speedup over a single crc32q instruction, since the
    // throughput of that instruction is one cycle, but the latency is three
    // cycles.
    static uint32_t crc32c_hw(uint32_t crc, void const *buf, size_t len) {
        if (buf == NULL)
            return 0;
    
        // Pre-process the crc.
        uint64_t crc0 = crc ^ 0xffffffff;
    
        // Compute the crc for up to seven leading bytes, bringing the data pointer
        // to an eight-byte boundary.
        unsigned char const *next = buf;
        while (len && ((uintptr_t)next & 7) != 0) {
            __asm__("crc32b\t%1, %0"
                    : "+r"(crc0)
                    : "m"(*next));
            next++;
            len--;
        }
    
        // Compute the crc on sets of LONG*3 bytes, making use of the three-cycle
        // latency and one-cycle throughput of a single core.
        while (len >= LONG*3) {
            uint64_t crc1 = 0;
            uint64_t crc2 = 0;
            unsigned char const *end = next + LONG;
            do {
                __asm__("crc32q\t%3, %0\n\t"
                        "crc32q\t%4, %1\n\t"
                        "crc32q\t%5, %2"
                        : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                        : "m"(*(uint64_t const *)next),
                          "m"(*(uint64_t const *)(next + LONG)),
                          "m"(*(uint64_t const *)(next + 2*LONG)));
                next += 8;
            } while (next < end);
            crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1;
            crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2;
            next += LONG*2;
            len -= LONG*3;
        }
    
        // Do the same thing, but now on SHORT*3 blocks for the remaining data less
        // than a LONG*3 block.
        while (len >= SHORT*3) {
            uint64_t crc1 = 0;
            uint64_t crc2 = 0;
            unsigned char const *end = next + SHORT;
            do {
                __asm__("crc32q\t%3, %0\n\t"
                        "crc32q\t%4, %1\n\t"
                        "crc32q\t%5, %2"
                        : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                        : "m"(*(uint64_t const *)next),
                          "m"(*(uint64_t const *)(next + SHORT)),
                          "m"(*(uint64_t const *)(next + 2*SHORT)));
                next += 8;
            } while (next < end);
            crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1;
            crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2;
            next += SHORT*2;
            len -= SHORT*3;
        }
    
        // Compute the crc on the remaining eight-byte units less than a SHORT*3
        // block.
        unsigned char const *end = next + (len - (len & 7));
        while (next < end) {
            __asm__("crc32q\t%1, %0"
                    : "+r"(crc0)
                    : "m"(*(uint64_t const *)next));
            next += 8;
        }
        len &= 7;
    
        // Compute the crc for up to seven trailing bytes.
        while (len) {
            __asm__("crc32b\t%1, %0"
                    : "+r"(crc0)
                    : "m"(*next));
            next++;
            len--;
        }
    
        // Return the crc, post-processed.
        return ~(uint32_t)crc0;
    }
    
    // Check for SSE 4.2. SSE 4.2 was first supported in Nehalem processors
    // introduced in November, 2008. This does not check for the existence of the
    // cpuid instruction itself, which was introduced on the 486SL in 1992, so this
    // will fail on earlier x86 processors. cpuid works on all Pentium and later
    // processors.
    #define SSE42(have) \
        do { \
            uint32_t eax, ecx; \
            eax = 1; \
            __asm__("cpuid" \
                    : "=c"(ecx) \
                    : "a"(eax) \
                    : "%ebx", "%edx"); \
            (have) = (ecx >> 20) & 1; \
        } while (0)
    
    // Compute a CRC-32C. If the crc32 instruction is available, use the hardware
    // version. Otherwise, use the software version.
    uint32_t crc32c(uint32_t crc, void const *buf, size_t len) {
        int sse42;
        SSE42(sse42);
        return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len);
    }
    
    #ifdef TEST
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    
    #define SIZE (262144*3)
    #define CHUNK SIZE
    
    int main(int argc, char **argv) {
        (void)argv;
        uint32_t crc = 0;
        char *buf = malloc(SIZE);
        if (buf == NULL) {
            fputs("out of memory", stderr);
            return 1;
        }
        ssize_t got;
        while ((got = read(0, buf, SIZE)) > 0) {
            size_t off = 0;
            do {
                size_t n = (size_t)got - off;
                if (n > CHUNK)
                    n = CHUNK;
                crc = argc > 1 ? crc32c_sw(crc, buf + off, n) :
                                 crc32c(crc, buf + off, n);
                off += n;
            } while (off < (size_t)got);
        }
        free(buf);
        if (got == -1) {
            fputs("read error\n", stderr);
            return 1;
        }
        printf("%08x\n", crc);
        return 0;
    }
    
    #endif /* TEST */
    

    Code to generate crc32c.h (stackoverflow won't let me post the tables themselves, due to a 30,000 character limit in an answer):

    // Generate crc32c.h for crc32c.c.
    
    #include <stdio.h>
    #include <stdint.h>
    
    #define LONG 8192
    #define SHORT 256
    
    // Print a 2-D table of four-byte constants in hex.
    static void print_table(uint32_t *tab, size_t rows, size_t cols, char *name) {
        printf("static uint32_t const %s[][%zu] = {\n", name, cols);
        size_t end = rows * cols;
        size_t k = 0;
        for (;;) {
            fputs("   {", stdout);
            size_t n = 0, j = 0;
            for (;;) {
                printf("0x%08x", tab[k + n]);
                if (++n == cols)
                    break;
                putchar(',');
                if (++j == 6) {
                    fputs("\n   ", stdout);
                    j = 0;
                }
                putchar(' ');
            }
            k += cols;
            if (k == end)
                break;
            puts("},");
        }
        puts("}\n};");
    }
    
    /* CRC-32C (iSCSI) polynomial in reversed bit order. */
    #define POLY 0x82f63b78
    
    static void crc32c_word_table(void) {
        uint32_t table[8][256];
    
        // Generate byte-wise table.
        for (unsigned n = 0; n < 256; n++) {
            uint32_t crc = ~n;
            for (unsigned k = 0; k < 8; k++)
                crc = crc & 1 ? (crc >> 1) ^ POLY : crc >> 1;
            table[0][n] = ~crc;
        }
    
        // Use byte-wise table to generate word-wise table.
        for (unsigned n = 0; n < 256; n++) {
            uint32_t crc = ~table[0][n];
            for (unsigned k = 1; k < 8; k++) {
                crc = table[0][crc & 0xff] ^ (crc >> 8);
                table[k][n] = ~crc;
            }
        }
    
        // Print table.
        print_table(table[0], 8, 256, "crc32c_table");
    }
    
    // Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
    // polynomial. For speed, this requires that a not be zero.
    static uint32_t multmodp(uint32_t a, uint32_t b) {
        uint32_t prod = 0;
        for (;;) {
            if (a & 0x80000000) {
                prod ^= b;
                if ((a & 0x7fffffff) == 0)
                    break;
            }
            a <<= 1;
            b = b & 1 ? (b >> 1) ^ POLY : b >> 1;
        }
        return prod;
    }
    
    /* Take a length and build four lookup tables for applying the zeros operator
       for that length, byte-by-byte, on the operand. */
    static void crc32c_zero_table(size_t len, char *name) {
        // Generate operator for len zeros.
        uint32_t op = 0x80000000;               // 1 (x^0)
        uint32_t sq = op >> 4;                  // x^4
        while (len) {
            sq = multmodp(sq, sq);              // x^2^(k+3), k == len bit position
            if (len & 1)
                op = multmodp(sq, op);
            len >>= 1;
        }
    
        // Generate table to update each byte of a CRC using op.
        uint32_t table[4][256];
        for (unsigned n = 0; n < 256; n++) {
            table[0][n] = multmodp(op, n);
            table[1][n] = multmodp(op, n << 8);
            table[2][n] = multmodp(op, n << 16);
            table[3][n] = multmodp(op, n << 24);
        }
    
        // Print the table to stdout.
        print_table(table[0], 4, 256, name);
    }
    
    int main(void) {
        puts(
    "// crc32c.h\n"
    "// Tables and constants for crc32c.c software and hardware calculations.\n"
    "\n"
    "// Table for a 64-bits-at-a-time software CRC-32C calculation. This table\n"
    "// has built into it the pre and post bit inversion of the CRC."
        );
        crc32c_word_table();
        puts(
    "\n// Block sizes for three-way parallel crc computation.  LONG and SHORT\n"
    "// must both be powers of two."
            );
        printf("#define LONG %d\n", LONG);
        printf("#define SHORT %d\n", SHORT);
        puts(
    "\n// Table to shift a CRC-32C by LONG bytes."
        );
        crc32c_zero_table(8192, "crc32c_long");
        puts(
    "\n// Table to shift a CRC-32C by SHORT bytes."
        );
        crc32c_zero_table(256, "crc32c_short");
        return 0;
    }