Search code examples
ssesimdavxavx2avx512

SIMD: Bit-pack signed integers


Unsigned integers can be compressed by using "bit-packing" techniques: Within a block of unsigned integers only the significant bits are stored, resulting in data compression when all integers in a block are "small". The method is known as FOR (frame of reference).

There are SIMD libraries that do this very efficiently.

Now I want to use FOR-like techniques to encode signed integers, e.g. from a differenced sequence of unsorted unsigned integers. The sign of each signed integer needs to be stored somewhere, there are two options:

  1. Store the signs in a separate block of data. This adds overhead.
  2. Store the sign together with the absolute value of each signed integer.

I'm following path 2 right now. The 2-s complement has the sign bit in the msb (most signfificant bit), so that won't work for bit-packing à la FOR. One possibility is to store the sign in the lsb (least significant bit). Storing signed integers this way is very unusual, there are no instruction that support this, as far as I know. The question now is: Can these lsb-signed-integers be encoded/decoded efficiently using SIMD instructions?

I think AVX-512 _mm_testn_epi32_mask can be used to extract the lsb from each uint32, followed by a shift, then two mask_extract of some sort? Quite convoluted.


Solution

  • Untested ZigZag examples in C using SSE2 for 64-bit integers:

    (note: SSE2 is missing some 64-bit instructions...)

    #include <emmintrin.h>
    
    
    // from comment by Peter-Cordes 
    __m128i zigzag_encode_epi64(__m128i v) {
        __m128i signmask = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,3,1,1));
        signmask = _mm_srai_epi32(signmask, 31);
        return _mm_xor_si128(_mm_add_epi64(v, v), signmask);
    }
    
    
    __m128i zigzag_decode_epi64 (__m128i v) {
        __m128i signmask = _mm_and_si128(_mm_set_epi32(0, 1, 0, 1), v);
        signmask = _mm_sub_epi64(_mm_setzero_si128(), signmask);
        return _mm_xor_si128(_mm_srli_epi64(v, 1), signmask);
    }
    
    
    // no constant
    __m128i zigzag_decodev3_epi64 (__m128i v) {
        __m128i t = _mm_srli_epi64(v, 1);
        __m128i signmask = _mm_sub_epi64(_mm_slli_epi64(t, 1), v);
        return _mm_xor_si128(t, signmask);
    }
    

    Zigzag is good for bitwise varints. However, a bytewise group-varint may wish to "sign extend from a variable bit-width".


    32-bit examples

    I favored compares over arithmetic shifts. I assume - when unrolled - compares will have 1 cycle lower latency.

    __m128i zigzag_encode_epi32 (__m128i v) {
        __m128i signmask =_mm_cmpgt_epi32(_mm_setzero_si128(), v);
        return _mm_xor_si128(_mm_add_epi32(v, v), signmask);
    }
    
    
    __m128i zigzag_decode_epi32 (__m128i v) {
        const __m128i m = _mm_set1_epi32(1);
        __m128i signmask =_mm_cmpeq_epi32(_mm_and_si128(m, v), m);
        return _mm_xor_si128(_mm_srli_epi32(v, 1), signmask);
    }
    
    
    __m128i delta_encode_epi32 (__m128i v, __m128i prev) {
        return _mm_sub_epi32(v, _mm_alignr_epi8(v, prev, 12));
    }
    
    
    // prefix sum (see many of answers around stackoverflow...)
    __m128i delta_decode_epi32 (__m128i v, __m128i prev) {
        prev = _mm_shuffle_epi32(prev, _MM_SHUFFLE(3,3,3,3)); // [P P P P]
        v = _mm_add_epi32(v, _mm_slli_si128(v, 4)); // [A AB BC CD]
        prev = _mm_add_epi32(prev, v); // [PA PAB PBC PCD]
        v = _mm_slli_si128(v, 8); // [0 0 A AB]
        return _mm_add_epi32(prev, v); // [PA PAB PABC PABCD]
    }
    
    
    __m128i delta_zigzag_encode_epi32 (__m128i v, __m128i prev) {
        return zigzag_encode_epi32(delta_encode_epi32(v, prev));
    }
    
    
    __m128i delta_zigzag_decode_epi32 (__m128i v, __m128i prev) {
        return delta_decode_epi32(zigzag_decode_epi32(v), prev);
    }
    

    Note: Delta coding would be faster (round-trip/decoding) to transpose the elements while encoding then transpose them back again during decoding; horizontal prefix sums are really slow. However, determining the optimum number of elements to transpose in each batch seems like a hard problem.