How can 'clmul' intrinsics be used to improve computation of CRC in RISCV?

My goal is to understand and improve the computation of CRC using clmul instructions in RISCV. The way described in RISCV bitmanip documentation is following:

4.8 Cyclic redundency checks (CRC)

There are special instructions for performing CRCs using the two most widespread 32-bit CRC polynomials, CRC-32 and CRC-32C. CRCs with other polynomials can be computed efficiently using CLMUL. The following examples RISC-V Bitmanip Extension V0.93 83 are using CRC32Q. The easiest way of implementing CRC32Q with clmul is using a Barrett reduction. On RV32:

uint32_t crc32q_simple(const uint32_t *data, int length) {
    uint32_t P = 0x814141AB;    // CRC polynomial (implicit x^32)
    uint32_t mu = 0xFEFF7F62;   // x^64 divided by CRC polynomial
    uint32_t mu1 = 0xFF7FBFB1;  // "mu" with leading 1, shifted right by 1 bit
    uint32_t crc = 0;
    for (int i = 0; i < length; i++) {
        crc ^= rev8(data[i]);
        crc = clmulr(crc, mu1);
        crc = clmul(crc, P);
    }
    return crc;
}

I already know how constants mu and mu1 are computed, but I still have questions:

Why can original internal loop be replaced with clmul intrinsics? This is not stated in documentation. For reference, this is a generic CRC code sample for CRC-32/AIXM:

uint32_t crc32_bitwise_aixm(const uint8_t* data, size_t length) {   
  uint32_t crc = 0;
  for (int i = 0; i < length; i++) {
    uint8_t byte = data[i];
    for (int j = 0; j < 8; j++) {
      if ((crc ^ (byte << 24)) & 0x80000000)
        crc = (crc << 1) ^ 0x814141ab; // Here is the polynomial from which constants mu and mu1 are computed
      else
        crc = crc << 1;
      byte = byte << 1;
    }
  }
  return crc;
}

I know how this constants are computed, but the question is, WHY are they computed and why passing them to clmul makes working CRC?
What is 0x80000000 constant in original CRC loop? I think it has something to do with masking portion of number byte << 24.

Solution

There is too much to explain in an answer here. Since you don't even understand the 0x80000000, you need to first understand at least the basic mathematics of cyclic redundancy check and importantly their practical implementations. I recommend Ross William's guide. The simple answer for just your specific question is that it is implementing a linear feedback shift register that shifts up, and is reading the top bit to feed back into the other bits using an exclusive-or mask. However you also need to understand how it is that such a shift register implements a polynomial division over GF(2), where a CRC is the remainder of such a division.

Once you understand all that, then you're ready for how to do a CRC using multiplication instead of division. You already named the approach, called Barrett's Reduction, originally for integer division but applied here to polynomial division. Read that post. In gross terms, you are replacing division by a polynomial with multiplication by the reciprocal of that polynomial. Your mu is the reciprocal. Carry-less multiplication is polynomial multiplication, where each bit in a register is the coefficient of a polynomial over GF(2).