Search code examples
gccauto-vectorization

Gcc autovectorization weird behaviour in matrix multiply when arrays are function parameters


I'm benchmarking different matrix multiply forms with different optimization levels (for teaching purposes) and I detected a strange behavior in gcc autovectorization. It fails to vectorize when arrays are parameters (see mxmp) but is able to vectorize when arrays are global variables (see mxmg)

gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) but behaviour was the same with older gcc versions

Compiling options: gcc -O3 -mavx2 -mfma

#define N 1024
float A[N][N], B[N][N], C[N][N];

void mxmp(float A[N][N], float B[N][N], float C[N][N]) {
  int i,j,k;
  for (i=0; i<N; i++)
    for (j=0; j<N; j++)
      for (k=0; k<N; k++)
        C[i][j] = C[i][j] + A[i][k] * B[k][j];
}

void mxmg() {
  int i,j,k;
  for (i=0; i<N; i++)
    for (j=0; j<N; j++)
      for (k=0; k<N; k++)
        C[i][j] = C[i][j] + A[i][k] * B[k][j];
}

main(){
  mxmg();
  mxmp(A, B, C);
}

I expected the compiler to do the same in both functions however mxmp requires about 10 times the execution time of mxmg. Exploring the assembly code it just happens that gcc is able to autovectorize mxmg (when arrays are global variables) but fails to vectorize mxmp (where arrays are parameters).

Tried the same with kij form and it's able to vectorize both functions.

I need help to discover why gcc has this behavior. And how to help gcc (pragmas, compile options, atributes, ...) to properly vectorize mxmp function. Thanks


Solution

  • When the arrays are global, the compiler can easily see that they are disjoint memory regions. When they are function parameters, you could call mxmp(A,A,A), so it has to assume that writing to C may modify A or B, which could affect later iterations and complicates vectorization. Of course the compiler could inline or do other things to know it in your particular case...

    You can explicitly specify the lack of aliasing with restrict:

    void mxmp(float A[restrict N][N], float B[restrict N][N], float C[restrict N][N]) {