Search code examples
gccopenmpauto-vectorization

gcc auto-vectorisation (unhandled data-ref)


I do not understand why such code is not vectorized with gcc 4.4.6

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + pfTab[iIndex];
}

 note: not vectorized: unhandled data-ref

However, if I write the following code

   int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

gcc succeeds auto-vectorize this loop

if I add omp directive

   int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  #pragma omp parallel for
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

i have the following error not vectorized: unhandled data-ref

Could you please help me why the first code and third code is not auto-vectorized ?

Second question: math operand seems to be not vectorized (exp, log , etc...), this code for example

for (int i = 0; i < iSize; i++)
         pfResult[i] = exp(pfResult[i]);

is not vectorized. It is due to my version of gcc ?

Edit: with new version of gcc 4.8.1 and openMP 2011 (echo |cpp -fopenmp -dM |grep -i open) i have the following error for all kind of loop even basically

   for (iGID = 0; iGID < iSize; iGID++)
        {
             pfResult[iGID] = fValue;
        }


note: not consecutive access *_144 = 5.0e-1;
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.

Edit2:

#include<stdio.h>
#include<sys/time.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>

int main()
{
        int szGlobalWorkSize = 131072;
        int iGID = 0;
        int j = 0;
        omp_set_dynamic(0);
        // warmup
        #if WARMUP
        #pragma omp parallel
        {
        #pragma omp master
        {
        printf("%d threads\n", omp_get_num_threads());
        }
        }
        #endif
        printf("Pagesize=%d\n", getpagesize());
        float *pfResult = (float *)malloc(szGlobalWorkSize * 100* sizeof(float));
        float fValue = 0.5f;
        struct timeval tim;
        gettimeofday(&tim, NULL);
        double tLaunch1=tim.tv_sec+(tim.tv_usec/1000000.0);
        double time = omp_get_wtime();
        int iChunk = getpagesize();
        int iSize = ((int)szGlobalWorkSize * 100) / iChunk;
        //#pragma omp parallel for
        for (iGID = 0; iGID < iSize; iGID++)
        {
             pfResult[iGID] = fValue;
        }
        time = omp_get_wtime() - time;
        gettimeofday(&tim, NULL);
        double tLaunch2=tim.tv_sec+(tim.tv_usec/1000000.0);
        printf("%.6lf Time1\n", tLaunch2-tLaunch1);
        printf("%.6lf Time2\n", time);
}

result with

#define _OPENMP 201107
gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)

gcc -march=native -fopenmp -O3 -ftree-vectorizer-verbose=2 test.c -lm

lot of

note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
and note: not consecutive access *_144 = 5.0e-1;

Thanks


Solution

  • GCC cannot vectorise the first version of your loop because it cannot prove that pfTab[iIndex] is not contained somewhere within the memory spanned by pfResult[0] ... pfResult[iSize-1] (pointer aliasing). Indeed, if pfTab[iIndex] is somewhere within that memory, then its value must be overwritten by the assignment in the loop body and the new value must be used in the iterations to follow. You should use the restrict keyword to hint the compiler that this could never happen and then it should happily vectorise your code:

    $ cat foo.c
    int MyFunc(const float *restrict pfTab, float *restrict pfResult,
               int iSize, int iIndex)
    {
       for (int i = 0; i < iSize; i++)
         pfResult[i] = pfResult[i] + pfTab[iIndex];
    }
    $ gcc -v
    ...
    gcc version 4.6.1 (GCC)
    $ gcc -std=c99 -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
    foo.c:3: note: LOOP VECTORIZED.
    foo.c:1: note: vectorized 1 loops in function.
    

    The second version vectorises since the value is transferred to a variable with an automatic storage duration. The general assumption here is that pfResult does not span over the stack memory where fTab is stored (a cursory read through the C99 language specification doesn't make it clear if that assumption is weak or something in the standard allows it).

    The OpenMP version does not vectorise because of the way OpenMP is implemented in GCC. It uses code outlining for the parallel regions.

    int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
    {
      float fTab =  pfTab[iIndex];
      #pragma omp parallel for
      for (int i = 0; i < iSize; i++)
         pfResult[i] = pfResult[i] + fTab;
    }
    

    effectively becomes:

    struct omp_data_s
    {
      float *pfResult;
      int iSize;
      float *fTab;
    };
    
    int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
    {
      float fTab =  pfTab[iIndex];
      struct omp_data_s omp_data_o;
    
      omp_data_o.pfResult = pfResult;
      omp_data_o.iSize = iSize;
      omp_data_o.fTab = fTab;
    
      GOMP_parallel_start (MyFunc_omp_fn0, &omp_data_o, 0);
      MyFunc._omp_fn.0 (&omp_data_o);
      GOMP_parallel_end ();
      pfResult = omp_data_o.pfResult;
      iSize = omp_data_o.iSize;
      fTab = omp_data_o.fTab;
    }
    
    void MyFunc_omp_fn0 (struct omp_data_s *omp_data_i)
    {
      int start = ...; // compute starting iteration for current thread
      int end = ...; // compute ending iteration for current thread
    
      for (int i = start; i < end; i++)
        omp_data_i->pfResult[i] = omp_data_i->pfResult[i] + omp_data_i->fTab;
    }
    

    MyFunc_omp_fn0 contains the outlined function code. The compiler is not able to prove that omp_data_i->pfResult does not point to memory that aliases omp_data_i and specifically its member fTab.

    In order to vectorise that loop, you have to make fTab firstprivate. This will turn it into an automatic variable in the outlined code and that will be equivalent to your second case:

    $ cat foo.c
    int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
    {
       float fTab = pfTab[iIndex];
       #pragma omp parallel for firstprivate(fTab)
       for (int i = 0; i < iSize; i++)
         pfResult[i] = pfResult[i] + fTab;
    }
    $ gcc -std=c99 -fopenmp -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
    foo.c:6: note: LOOP VECTORIZED.
    foo.c:4: note: vectorized 1 loops in function.