Search code examples
cvectorizationsimd

How to Vectorize using Intel Intrinsics in C programming language?


I want to vectorize a c program.

I searched on the internet, YouTube but found very little (which was not helpful for beginner like me and most of them were about c++). Whatever little I understood, is that I have to use compiler intrinsics (which can be found in Intel Intrinsics Guide). I have an old machine which supports SSE 4.1, SSE 4.2 instruction.

But I can not move forward with the little knowledge I have, so my question is, how can I vectorize a c program?

As a demonstration, can you show how to optimize the following code:

float function(float* Array, int Initial, int Finishing_point)
{
    int k = 0;
    float VL = 0;
    for (int i = Initial; i < Finishing_point; i++)
    {
        k++;
        Vl = Vl + Array[i] * pow(2, k);
    }    
    return Vl;
}

Please note that, I need an introductory example, thus I am using an example that includes summation, array operation and other simple programming.


Solution

  • Here’s the manually vectorized function, it requires SSE1 and SSE3.

    #include <xmmintrin.h>  // SSE 1
    #include <pmmintrin.h>  // SSE 3
    
    float computeThings( const float* rsi, int idxFirst, int idxEnd )
    {
        // Figure out the slice of the input array to consume
        size_t count = (size_t)( idxEnd - idxFirst );
        size_t countAligned = ( count / 4 ) * 4;
        rsi += idxFirst;
        const float* endAligned = rsi + countAligned;
        const float* end = rsi + count;
    
        // Process majority of inputs with SSE
        __m128 acc = _mm_setzero_ps();
        __m128 kexp = _mm_setr_ps( 2, 4, 8, 16 );
        for( ; rsi < endAligned; rsi += 4 )
        {
            __m128 v = _mm_loadu_ps( rsi );
            v = _mm_mul_ps( v, kexp );
            kexp = _mm_mul_ps( kexp, _mm_set1_ps( 16 ) );
            acc = _mm_add_ps( acc, v );
        }
    
        // Compute horizontal sum of the `acc` vector
    
        // acc.xyzw += acc.yyww
        acc = _mm_add_ps( acc, _mm_movehdup_ps( acc ) );
        // acc.x += acc.z
        acc = _mm_add_ss( acc, _mm_unpackhi_ps( acc, acc ) );
    
        // Process the remaining 0-3 numbers
        for( ; rsi < end; rsi++ )
        {
            __m128 v = _mm_load_ss( rsi );
            v = _mm_mul_ss( v, kexp );
            // kexp.x *= 2, computed as kexp.x += kexp.x
            kexp = _mm_add_ss( kexp, kexp );
            acc = _mm_add_ss( acc, v );
        }
    
        return _mm_cvtss_f32( acc );
    }
    

    Usage example:

    float A[] = { 1,2,3,4,5,6,7,8 };
    printf( "%g", computeThings( A, 0, 8 ) );