Search code examples
casciisimdsseintrinsics

bitpack ascii string into 7-bit binary blob using SIMD


Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics.


I would like to encode a char string as a 7-bit blob to gain a 12.5% reduction in memory. I want to do it as fast a possible, i.e. with minimal latency when encoding large strings.

Here is the plain implementation of the algo:

void ascii_pack(const char* ascii, size_t len, uint8_t* bin) {
  uint64_t val;
  const char* end = ascii + len;

  while (ascii + 8 <= end) {
    memcpy(&val, ascii, 8);
    uint64_t dest = (val & 0xFF);

    // Compiler will perform loop unrolling
    for (unsigned i = 1; i <= 7; ++i) {
      val >>= 1;
      dest |= (val & (0x7FUL << 7 * i));
    }
    memcpy(bin, &dest, 7);
    bin += 7;
    ascii += 8;
  }

  // epilog - we do not pack since we have less than 8 bytes.
  while (ascii < end) {
    *bin++ = *ascii++;
  }
}

now, I would like to speed it up with SIMD. I came with SSE2 algo below. My question:

  1. is it possible to optimize the internal loop that is sequential?
  2. will it improve the throughput when running on large strings?

// The algo - do in parallel what ascii_pack does on two uint64_t integers
void ascii_pack_simd(const char* ascii, size_t len, uint8_t* bin) {
  __m128i val;

  __m128i mask = _mm_set1_epi64x(0x7FU);  // two uint64_t masks

  // I leave out 16 bytes in addition to 16 that we load in the loop
  // because we store into "bin" full 16 bytes instead of 14. To prevent out of bound
  // writes we finish one iteration earlier.
  const char* end = ascii + len - 32;
  while (ascii <= end) {
    val = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ascii));
    __m128i dest = _mm_and_si128(val, mask);

    // Compiler unrolls it
    for (unsigned i = 1; i <= 7; ++i) {
      val = _mm_srli_epi64(val, 1);                          // shift right both integers
      __m128i shmask = _mm_slli_epi64(mask, 7 * i);    // mask both
      dest = _mm_or_si128(dest, _mm_and_si128(val, shmask));  // add another 7bit part.
    }

    // dest contains two 7 byte blobs. Lets copy them to bin.
    _mm_storeu_si128(reinterpret_cast<__m128i*>(bin), dest);
    memmove(bin + 7, bin + 8, 7);
    bin += 14;
    ascii += 16;
  }

  end += 32;  // Bring back end.
  DCHECK(ascii < end);
  ascii_pack(ascii, end - ascii, bin);
}


Solution

  • The scalar trick (without requiring PEXT) which I referred to in the comments could be implemented like this:

    uint64_t compress8x7bit(uint64_t x)
    {
        x = ((x & 0x7F007F007F007F00) >> 1) | (x & 0x007F007F007F007F);
        x = ((x & 0x3FFF00003FFF0000) >> 2) | (x & 0x00003FFF00003FFF);
        x = ((x & 0x0FFFFFFF00000000) >> 4) | (x & 0x000000000FFFFFFF);
        return x;
    }
    

    The idea here is to concatenate together adjacent pairs, first concatenate 7-bit elements into 14-bit elements, then concatenate them into 28-bit elements, and finally concatenate them into one 56-bit chunk (which is the result).

    With SSSE3, you could use pshufb to concatenate two of those 56-bit parts (before storing them) too.

    SSE2 (and AVX2) can do the same thing as that scalar code with 64-bit elements, but this approach does not take advantage of any techniques that may be possible with special operations (which SSE2+ has plenty of, more with every version), there are probably better things to do than just implementing the scalar trick in SIMD.

    For example just to throw something wild out there, gf2p8affineqb(0x8040201008040201, x) would put all the "discarded" bits in one place (namely the top byte of the result) and makes a solid 56-bit chunk out of the bits that we want to keep. But the bits do end up in a strange order (the first byte would contain bits 56, 48, 40, 32, 24, 16, 8, 0, in that order, listing the least significant bit first).

    That order, strange as it is, can be easily unpacked using pshufb to reverse the bytes (you can also use this to insert the two zeroes) and then gf2p8affineqb(0x0102040810204080, reversedBytes) shuffles the bits back into the original order.

    Here's a sketch of how that could work with actual AVX2+GFNI intrinsics. I'm not bothering to handle the extra parts at the end here, just the "main" loop, so the input text had better be a multiple of 32 bytes. Works on my PC ✔️

    void compress8x7bit(const char* ascii, size_t len, uint8_t* bin)
    {
        const char* end = ascii + len;
        while (ascii + 31 < end) {
            __m256i text = _mm256_loadu_si256((__m256i*)ascii);
            __m256i transposed = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x8040201008040201), text, 0);
            __m256i compressed = _mm256_shuffle_epi8(transposed, 
                _mm256_set_epi8(-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0,
                                -1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0));
            _mm_storeu_si128((__m128i*)bin, _mm256_castsi256_si128(compressed));
            _mm_storeu_si128((__m128i*)(bin + 14), _mm256_extracti128_si256(compressed, 1));
            bin += 28;
            ascii += 32;
        }
    }
    
    void uncompress8x7bit(char* ascii, size_t len, const uint8_t* bin)
    {
        const char* end = ascii + len;
        while (ascii + 31 < end) {
            __m256i raw = _mm256_inserti128_si256(_mm256_castsi128_si256(_mm_loadu_si128((__m128i*)bin)), _mm_loadu_si128((__m128i*)(bin + 14)), 1);
            __m256i rev_with_zeroes = _mm256_shuffle_epi8(raw, 
                _mm256_set_epi8(7, 8, 9, 10, 11, 12, 13, -1, 0, 1, 2, 3, 4, 5, 6, -1,
                                7, 8, 9, 10, 11, 12, 13, -1, 0, 1, 2, 3, 4, 5, 6, -1));
            __m256i decompressed = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x0102040810204080), rev_with_zeroes, 0);
            _mm256_storeu_si256((__m256i*)ascii, decompressed);
            bin += 28;
            ascii += 32;
        }
    }
    

    Perhaps there is a nicer solution than using two 128-bit stores in the compressor and two 128-bit loads in the uncompressor. With AVX512 that would be easy since it has full-register byte-granular permutes, but AVX2 has vpshufb, which is not able to move bytes between the two 128-bit halves that make up a 256-bit vector. The uncompressor could do a funny load that starts 2 bytes before the start of the data it wants, like this: _mm256_loadu_si256((__m256i*)(bin - 2)) (and a slightly different shuffle vector), at the cost of having to avoid a potential out-of-bounds error with either padding or a special first iteration, but the compressor cannot (not cheaply) use a trick like that with a store that start 2 bytes earlier (that would destroy two bytes of the result).

    By the way I have some test code here that you can use to verify that your bit-compression functions do the right thing (well sort of - as long as the function is a bit-permutation where some of the bits may be zeroed this works as a check, but this would not detect every possible bug in general):

    uint64_t bitindex[7];
    bitindex[6] = compress8x7bit(0xFFFFFFFFFFFFFFFF);
    bitindex[5] = compress8x7bit(0xFFFFFFFF00000000);
    bitindex[4] = compress8x7bit(0xFFFF0000FFFF0000);
    bitindex[3] = compress8x7bit(0xFF00FF00FF00FF00);
    bitindex[2] = compress8x7bit(0xF0F0F0F0F0F0F0F0);
    bitindex[1] = compress8x7bit(0xCCCCCCCCCCCCCCCC);
    bitindex[0] = compress8x7bit(0xAAAAAAAAAAAAAAAA);
    
    for (size_t i = 0; i < 64; i++)
    {
        if (i != 0)
            std::cout << ", ";
        if (bitindex[6] & (1uLL << i))
        {
            int index = 0;
            for (size_t j = 0; j < 6; j++)
            {
                if (bitindex[j] & (1uLL << i))
                    index |= 1 << j;
            }
            std::cout << index;
        }
        else
            std::cout << "_";
    }
    std::cout << "\n";