Search code examples
c#cassemblymasmsimd

Dot-product groups of 4 bytes against 4 small constants, over an array of bytes (efficiently using SIMD)?


I have a peculiar requirement that needs to be fulfilled efficiently. (SIMD, perhaps?)

src is an array of bytes. Every group of 4 bytes in the array need to be processed as:

  • Multiply low nibble of src[0] by a constant number A.
  • Multiply low nibble of src[1] by a constant number B.
  • Multiply low nibble of src[2] by a constant number C.
  • Multiply low nibble of src[3] by a constant number D.

Sum the four parts above to give result.

Move on to next 4 set of bytes and re-compute result (rinse & repeat till end of byte array).

result is guaranteed to be small (even fit in a byte) owing to all numbers involved being very small. However, the data type for result can be flexible to support an effecient algorithm.

Any suggestions / tips / tricks to go faster than the following pseudo-code?:

for (int i=0; i< length; i+=4)
{
  result = (src[i] & 0x0f) * A + (src[i+1] & 0x0f) * B + (src[i+2] & 0x0f) * C + (src[i+3] & 0x0f) * D;
}

BTW, result then forms an index into a higher-order array.

This particular loop is so crucial that implementation language is no barrier. Can choose language out of C#, C or MASM64


Solution

  • Here’s an example how to do that efficiently with SSE intrinsics.

    #include <stdint.h>
    #include <emmintrin.h>  // SSE 2
    #include <tmmintrin.h>  // SSSE 3
    #include <smmintrin.h>  // SSE 4.1
    
    // Vector constants for dot4Sse function
    struct ConstantVectorsSse
    {
        __m128i abcd;
        __m128i lowNibbleMask;
        __m128i zero;
    };
    
    // Pack 4 bytes into a single uint32_t value
    uint32_t packBytes( uint32_t a, uint32_t b, uint32_t c, uint32_t d )
    {
        b <<= 8;
        c <<= 16;
        d <<= 24;
        return a | b | c | d;
    }
    
    // Initialize vector constants for dot4Sse function
    struct ConstantVectorsSse makeConstantsSse( uint8_t a, uint8_t b, uint8_t c, uint8_t d )
    {
        struct ConstantVectorsSse cv;
        cv.abcd = _mm_set1_epi32( (int)packBytes( a, b, c, d ) );
        cv.lowNibbleMask = _mm_set1_epi8( 0x0F );
        cv.zero = _mm_setzero_si128();
        return cv;
    }
    
    // Dot products of 4 groups of 4 bytes in memory against 4 small constants
    // Returns a vector of 4 int32 lanes
    __m128i dot4Sse( const uint8_t* rsi, const struct ConstantVectorsSse* cv )
    {
        // Load 16 bytes, and mask away higher 4 bits in each byte
        __m128i v = _mm_loadu_si128( ( const __m128i* )rsi );
        v = _mm_and_si128( cv->lowNibbleMask, v );
    
        // Compute products, add pairwise
        v = _mm_maddubs_epi16( cv->abcd, v );
    
        // Final reduction step, add adjacent pairs of uint16_t lanes
        __m128i high = _mm_srli_epi32( v, 16 );
        __m128i low = _mm_blend_epi16( v, cv->zero, 0b10101010 );
        return _mm_add_epi32( high, low );
    }
    

    The code uses pmaddubsw SSSE3 instruction for multiplication and the first step of the reduction, then adds even/odd uint16_t lanes in the vector.

    The above code assumes your ABCD numbers are unsigned bytes. If they are signed, you gonna need to flip order of arguments of _mm_maddubs_epi16 intrinsic and use different code for the second reduction step, _mm_slli_epi32( v, 16 ), _mm_add_epi16, _mm_srai_epi32( v, 16 )

    If you have AVX2 the upgrade is trivial, replace __m128i with __m256i, and _mm_something with _mm256_something.

    If the length of your input is not necessarily a multiple of 4 groups, note you gonna need special handling for the final incomplete batch of numbers. Without _mm_maskload_epi32 AVX2 instruction, here's one possible way to load incomplete vector of 4-byte groups:

    __m128i loadPartial( const uint8_t* rsi, size_t rem )
    {
        assert( rem != 0 && rem < 4 );
        __m128i v;
        switch( rem )
        {
        case 1:
            v = _mm_cvtsi32_si128( *(const int*)( rsi ) );
            break;
        case 2:
            v = _mm_cvtsi64_si128( *(const int64_t*)( rsi ) );
            break;
        case 3:
            v = _mm_cvtsi64_si128( *(const int64_t*)( rsi ) );
            v = _mm_insert_epi32( v, *(const int*)( rsi + 8 ), 2 );
            break;
        }
        return v;
    }
    

    P.S. Since you then gonna use these integers to index, note it only takes a few cycles of latency to extract integers from SSE vectors with _mm_extract_epi32 instruction.