I am trying to autovectorize the following loop. In the following we loop with the i-
and j-
loop over the lower triangle of a matrix. Unfortunetly the vectorization report cannot vectorize (=translate to AVX SIMD instructions) the j- and the k-loop. But I think it is straightforward, because there are no pointer aliases (#pragma ivdep
and compiler option -D NOALIAS
) and the data (x: 1D-array and p: 1D-array) is aligned to 64 bytes.
It could be, that the if
-statement is a problem, but even with the if
-free solution (expensive shifting operation and count the sign of a double) the compiler is not able to vectorize this loop.
__assume_aligned(x, 64);
__assume_aligned(p, 64);
#pragma omp parallel for simd reduction(+:accum)
for ( int i = 1 ; i < N ; i++ ){ // loop over lower triangle (i,j), OpenMP SIMD LOOP WAS VECTORIZED
for ( int j = 0 ; j < i ; j++ ){ // <-- remark #25460: No loop optimizations reported
double __attribute__((aligned(64))) scalarp = 0.0;
#pragma omp simd
for ( int k=0 ; k < D ; k++ ){ // <-- remark #25460: No loop optimizations reported
// scalar product of \sum_k x_{i,k} \cdot x_{j,k}
scalarp += x[i*D + k] * x[j*D + k];
}
// Alternative to following if:
// accum += - ( (long long) floor( - ( scalarp + p[i] + p[j] ) ) >> 63);
#pragma ivdep
if ( scalarp + p[i] + p[j] >= 0 ){ // check if condition is satisfied
accum += 1;
}
}
}
Does it refer to the problem, that OpenMP starting points for each OpenMP thread are not known until run-time? I thought it this resolves the simd
clause and Intels auto-vectorization is aware of that.
Intel Compiler: 18.0.2 20180210
edit: I've looked into the assembly and now it is clear that the code is already vectorized, sorry for boardering all of you.
Looking into the assembly really helps. Code is already vectorized. OpenMP SIMD LOOP WAS VECTORIZED
takes also care of inner loop in this particular case.