How to Vectorize using Intel Intrinsics in C programming language?

I want to vectorize a c program.

I searched on the internet, YouTube but found very little (which was not helpful for beginner like me and most of them were about c++). Whatever little I understood, is that I have to use compiler intrinsics (which can be found in Intel Intrinsics Guide). I have an old machine which supports SSE 4.1, SSE 4.2 instruction.

But I can not move forward with the little knowledge I have, so my question is, how can I vectorize a c program?

As a demonstration, can you show how to optimize the following code:

float function(float* Array, int Initial, int Finishing_point)
{
    int k = 0;
    float VL = 0;
    for (int i = Initial; i < Finishing_point; i++)
    {
        k++;
        Vl = Vl + Array[i] * pow(2, k);
    }    
    return Vl;
}

Please note that, I need an introductory example, thus I am using an example that includes summation, array operation and other simple programming.

Solution

Here’s the manually vectorized function, it requires SSE1 and SSE3.

#include <xmmintrin.h>  // SSE 1
#include <pmmintrin.h>  // SSE 3

float computeThings( const float* rsi, int idxFirst, int idxEnd )
{
    // Figure out the slice of the input array to consume
    size_t count = (size_t)( idxEnd - idxFirst );
    size_t countAligned = ( count / 4 ) * 4;
    rsi += idxFirst;
    const float* endAligned = rsi + countAligned;
    const float* end = rsi + count;

    // Process majority of inputs with SSE
    __m128 acc = _mm_setzero_ps();
    __m128 kexp = _mm_setr_ps( 2, 4, 8, 16 );
    for( ; rsi < endAligned; rsi += 4 )
    {
        __m128 v = _mm_loadu_ps( rsi );
        v = _mm_mul_ps( v, kexp );
        kexp = _mm_mul_ps( kexp, _mm_set1_ps( 16 ) );
        acc = _mm_add_ps( acc, v );
    }

    // Compute horizontal sum of the `acc` vector

    // acc.xyzw += acc.yyww
    acc = _mm_add_ps( acc, _mm_movehdup_ps( acc ) );
    // acc.x += acc.z
    acc = _mm_add_ss( acc, _mm_unpackhi_ps( acc, acc ) );

    // Process the remaining 0-3 numbers
    for( ; rsi < end; rsi++ )
    {
        __m128 v = _mm_load_ss( rsi );
        v = _mm_mul_ss( v, kexp );
        // kexp.x *= 2, computed as kexp.x += kexp.x
        kexp = _mm_add_ss( kexp, kexp );
        acc = _mm_add_ss( acc, v );
    }

    return _mm_cvtss_f32( acc );
}

Usage example:

float A[] = { 1,2,3,4,5,6,7,8 };
printf( "%g", computeThings( A, 0, 8 ) );