Search code examples
cperformancegccoptimizationicc

Why is this code slower when the array size is even?


Warning: Actually it is not due to power of 2 but parity. See the EDIT section.


I have found a code which shows quite a strange behaviour.

The code uses a 2D array (size x size). When size is a power of 2, the code goes between 10% and 40% slower, the slowest being for size=32.

I have obtained these results with Intel compiler. If I compile with gcc 5.4, the power of 2 problem disappears but the code is 3x slower. Having tested it on different CPUs, I think it should be reproducible enough.

Code:

#define N 10000000

unsigned int tab1[TS][TS];

void simul1() {

  for(int i=0; i<TS; i++)
  for(int j=0; j<TS; j++) {

    if(i > 0)
      tab1[i][j] += tab1[i-1][j];

    if(j > 0)
      tab1[i][j] += tab1[i][j-1];
  }
}

int main() {

  for(int i=0; i<TS; i++)
  for(int j=0; j<TS; j++)
    tab1[j][i] = 0;


  for(int i=0; i<N; i++) {
    tab1[0][0] = 1;
    simul1();
  }

  return tab1[10][10];
}

Compilation:

icc:
        icc main.c -O3 -std=c99 -Wall -DTS=31 -o i31
        icc main.c -O3 -std=c99 -Wall -DTS=32 -o i32
        icc main.c -O3 -std=c99 -Wall -DTS=33 -o i33

gcc:
        gcc main.c -O3 -std=c99 -Wall -DTS=31 -o g31
        gcc main.c -O3 -std=c99 -Wall -DTS=32 -o g32
        gcc main.c -O3 -std=c99 -Wall -DTS=33 -o g33

Results on a Xeon E5:

time ./icc31
4.549s

time ./icc32
6.557s

time ./icc33
5.188s


time ./gcc31
13.335s

time ./gcc32
13.669s

time ./gcc33
14.399s

My questions:

  • Why is there a performance loss when the size of array is exactly 32 ?
  • (Optional: Why is icc 3x faster than gcc here ?)

EDIT: Actually this is due to parity, and only applies from size >= 32. The performance difference between even and odd numbers is consistent, and decreases when size becomes bigger. Here is a more detailed benchmark :

enter image description here

(The y axis is number of elements per second in millions, obtained with TS² * N / size / 1000000)

My CPU has a 32KB L1 cache and 256 KB L2


Solution

  • Why is icc 3x faster than gcc here?

    GCC is not able to vectorize the inner loop, as it reports, that there are dependencies between data-refs. Intel's compiler is smart enough to split the inner loop into two independent parts:

    for (int j = 1; j < TS; j++)
        tab1[i][j] += tab1[i-1][j];  // this one is vectorized
    
    for (int j = 1; j < TS; j++)
        tab1[i][j] += tab1[i][j-1];
    

    Attempt 1: Loops reorganization

    You might get better performance in GCC by rewriting simul1 into:

    void simul1(void)
    {
        for (int j = 1; j < TS; j++)
            tab1[0][j] += tab1[0][j-1];
    
        for (int i = 1; i < TS; i++) {
            for (int j = 0; j < TS; j++)
                tab1[i][j] += tab1[i-1][j];
            for (int j = 1; j < TS; j++)
                tab1[i][j] += tab1[i][j-1];
        }
    }
    

    My results under GCC 6.3.0 with -O3 -march-native, TS = 32 powered by Intel Core i5 5200U are:

    Original version:

    real    0m21.110s
    user    0m21.004s
    sys 0m0.000s
    

    Modified:

    real    0m8.588s
    user    0m8.536s
    sys 0m0.000s
    

    Attempt 2: Vectorization of prefix sum

    After some reaseach, I found there is a possibility to vectorize the second inner loop by vector additions and shifts. Algorithm is presented here.

    Version A: 128-bit wide vectors

    #include "emmintrin.h"
    
    void simul1(void)
    {
        for (int j = 1; j < TS; j++)
            tab1[0][j] += tab1[0][j-1];
    
        for (int i = 1; i < TS; i++) {
            for (int j = 0; j < TS; j++)
                tab1[i][j] += tab1[i-1][j];
    
            for (int stride = 0; stride < TS; stride += 4) {
                __m128i v;
                v = _mm_loadu_si128((__m128i*) (tab1[i] + stride));
                v = _mm_add_epi32(v, _mm_slli_si128(v, sizeof(int)));
                v = _mm_add_epi32(v, _mm_slli_si128(v, 2*sizeof(int)));
                _mm_storeu_si128((__m128i*) (tab1[i] + stride), v);
            }
    
            for (int stride = 4; stride < TS; stride += 4)
                for (int j = 0; j < 4; j++)
                    tab1[i][stride+j] += tab1[i][stride-1];
        }
    }
    

    Result:

    real    0m7.541s
    user    0m7.496s
    sys 0m0.004s
    

    Version B: 256-bit wide vectors

    This one is more complicated. Consider eight-elements vector of ints:

    V = (a, b, c, d, e, f, g, h)
    

    We can treat it as two packed vectors:

    (a, b, c, d), (e, f, g, h)
    

    First, algorithm performs two independent summations:

    (a, b, c, d), (e, f, g, h)
    +
    (0, a, b, c), (0, e, f, g)
    =
    (a, a+b, b+c, c+d), (e, e+f, f+g, g+h)
    +
    (0, 0, a, a+b), (0, 0, e, e+f)
    =
    (a, a+b, a+b+c, a+b+c+d), (e, e+f, e+f+g, e+f+g+h)
    

    then it propagates last element of first vector into each element of second vector, so it finally yields:

    (a, a+b, a+b+c, a+b+c+d), (a+b+c+d+e, a+b+c+d+e+f, a+b+c+d+e+f+g, a+b+c+d+e+f+g+h)
    

    I suspect, that these intrinsics could be written better, so there is a potential for some improvement.

    #include "immintrin.h"
    
    void simul1(void)
    {
        for (int j = 1; j < TS; j++)
            tab1[0][j] += tab1[0][j-1];
    
        for (int i = 1; i < TS; i++) {
            for (int j = 0; j < TS; j++)
                tab1[i][j] += tab1[i-1][j];
    
            for (int stride = 0; stride < TS; stride += 8) {
                __m256i v;
                v = _mm256_loadu_si256((__m256i*) (tab1[i] + stride));
                v = _mm256_add_epi32(v, _mm256_slli_si256(v, sizeof(int)));
                v = _mm256_add_epi32(v, _mm256_slli_si256(v, 2*sizeof(int)));
    
                __m256i t = _mm256_setzero_si256();
                t = _mm256_insertf128_si256(t, 
                    _mm_shuffle_epi32(_mm256_castsi256_si128(v), 0xFF), 1);
    
                v = _mm256_add_epi32(v, t);
                _mm256_storeu_si256((__m256i*) (tab1[i] + stride), v);
            }
    
            for (int stride = 8; stride < TS; stride += 8)
                for (int j = 0; j < 8; j++)
                    tab1[i][stride+j] += tab1[i][stride-1];
        }
    }
    

    Result (Clang 3.8):

    real    0m5.644s
    user    0m5.364s
    sys 0m0.004s