I would like GCC to vectorize the below code.-fopt-info
tells me that GCC is not currently. I believe the problem is the strided access of W
or possible the backward incrementing of k
. Note that height
and width
are constants and index_type
is set to unsigned long
currently.
I removed some comments
114 for (index_type k=height-1;k+1>0;k--) {
116 for (index_type i=0;i<width;i++) {
117 Yp[k*width + i] = 0.0;
119 for (index_type j=0;j<width;j++) {
121 Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
122 }
123 Yp[k*width + i] *= DF(Ap[k*width + i]);
124 }
125 }
I am compiling with gcc -O3 -ffast-math -fopt-info -std=c11 ./neural.c -o neural -lm
Is there a good way to make this vectorize? Can you refer me to further information?
Is my method for indexing a bad idea (ie the k*width*width + ...
)? I need to dynamically allocate, and I thought that keeping things close in memory would better enable optimizations.
EDIT: This might be useful
The -fopt-info-missed
output for these lines
./neural.c:114:3: note: not vectorized: multiple nested loops.
./neural.c:114:3: note: bad loop form.
./neural.c:116:5: note: not vectorized: control flow in loop.
./neural.c:116:5: note: bad loop form.
./neural.c:119:7: note: step unknown.
./neural.c:119:7: note: reduction used in loop.
./neural.c:119:7: note: Unknown def-use cycle pattern.
./neural.c:119:7: note: not vectorized: complicated access pattern.
./neural.c:119:7: note: bad data access.
./neural.c:110:21: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:58: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:62: note: not vectorized: not enough data-refs in basic block.
./neural.c:117:18: note: not vectorized: not enough data-refs in basic block.
./neural.c:114:37: note: not vectorized: not enough data-refs in basic block.
EDIT:
Minimal example is HERE
I am trying it with BLAS. In the minimal example, it goes faster, but on the whole code it is slower ... not sure why
EDIT:
Compiler was optimizing out code. Fixed. BLAS is now faster. The fix was on the whole code, not the minimal example.
EDIT:
Same code as in the link from the previous edit
#include <math.h>
#include <cblas.h>
#include <stdlib.h>
#include <stdio.h>
typedef float value_type;
typedef unsigned long index_type;
static value_type F(value_type v) {
return 1.0/(1.0 + exp(-v));
}
static value_type DF(value_type v) {
const value_type Ev = exp(-v);
return Ev/((1.0 + Ev)*(1.0 + Ev));
}
#ifndef WITH_BLAS
static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
const index_type height, const index_type width) {
for (index_type i=0;i<width;i++) {
Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
}
for (index_type k=height-1;k+1>0;k--) {
for (index_type i=0;i<width;i++) {
Yp[k*width + i] = 0.0;
for (index_type j=0;j<width;j++) {
Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
}
Yp[k*width + i] *= DF(Ap[k*width + i]);
}
}
}
#else
static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
const index_type height, const index_type width) {
for (index_type i=0;i<width;i++) {
Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
}
for (index_type k=height-1;k+1>0;k--) {
cblas_sgemv(CblasRowMajor, CblasTrans, width, width, 1,
W+k*width*width, width, Yp+(k+1)*width, 1, 0, Yp+k*width, 1);
for (index_type i=0;i<width;i++)
Yp[k*width + i] *= DF(Ap[k*width + i]);
}
}
#endif
int main() {
const index_type height=10, width=10000;
value_type *Ap=malloc((height+1)*width*sizeof(value_type)),
*W=malloc(height*width*width*sizeof(value_type)),
*Yp=malloc((height+1)*width*sizeof(value_type)),
*Dp=malloc(width*sizeof(value_type));
get_Yp(Ap, W, Yp, Dp, height, width);
printf("Done %f\n", Yp[3]);
return 0;
}
j-loop is well vectorizable SIMD reduction loop with constant stride of "width"-elements. You can vectorize it using modern compilers. This code is vectorizable with Intel Compiler and should be vectorizable by GCC under some conditions.
First of all, Reduction is a special case of "vectorizable" true loop-carried dependency. So you can't safely vectorize it unless "reduction" pattern is (a) auto-recognized by Compiler (not so easy and strictly speaking not so valid/expected behavior) or (b) communicated by developer to compiler explicitely using OpenMP or similar standards.
To "communicate" to compiler that there is a reduction - you need to use
#pragma omp simd reduction (+ : variable_name)
before j-loop.
This is only supported starting from OpenMP4.0. So you have to use GCC version which supports OpenMP4.x. Quote from https://gcc.gnu.org/wiki/openmp: "GCC 4.9 supports OpenMP 4.0 for C/C++, GCC 4.9.1 also for Fortran"
I would also use temporarily local variable for accumulating reduction ( OpenMP4.0 requires reduction variable to be used that way):
tmpSUM = 0;
#pragma omp simd reduction (+: tmpSUM)
for (index_type j=0;j<width;j++) {
tmpSUM += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
}
Yp[k*width + i] = tmpSUM
I'd also suggest to use signed int instead of unsigned, because unsinged induction variables are pretty bad for all modern vectorizers, at least by introducing extra overhead. I would not be surpised if using unsigned was one of the main reasons, "confused" GCC.
Now, you could be not-satisfied with my reply, because it tells about how it should work (and how it works in ICC/ICPC). It doesn't take into account specific nuances of GCC (which for reduction seemed to do oppostie), as one can see in GCC optimization report.
So, if you are still limited to GCC, I would suggest:
If none above help, take a look at "weird" things like described here (very similar to your case): What do gcc's auto-vectorization messages mean? or consider using other compiler/compiler version.