I have the following 4x4 matrix-vector multiply code:
double const __restrict__ a[16];
double const __restrict__ x[4];
double __restrict__ y[4];
//#pragma GCC unroll 1 - does not work either
#pragma GCC nounroll
for ( int j = 0; j < 4; ++j )
{
double const* __restrict__ aj = a + j * 4;
double const xj = x[j];
#pragma GCC ivdep
for ( int i = 0; i < 4; ++i )
{
y[i] += aj[i] * xj;
}
}
I compile with -O3 -mavx
flags. The inner loop is vectorized (single FMAD). However, gcc (7.2) keeps unrolling the outer loop 4 times, unless I use -O2
or lower optimization.
Is there a way to override -O3
unrolling of a particular loop?
NB. Similar #pragma nounroll
works if I use Intel icc.
According to the documentation, #pragma GCC unroll 1
is supposed to work, if you place it just so. If it doesn't then you should submit a bug report.
Alternatively, you can use a function attribute to set optimizations, I think:
void myfn () __attribute__((optimize("no-unroll-loops")));