GCC Hinting at Vectorization

I would like GCC to vectorize the below code.-fopt-info tells me that GCC is not currently. I believe the problem is the strided access of W or possible the backward incrementing of k. Note that height and width are constants and index_type is set to unsigned long currently.

I removed some comments

114  for (index_type k=height-1;k+1>0;k--) {
116    for (index_type i=0;i<width;i++) {
117      Yp[k*width + i] = 0.0;                                                            
119      for (index_type j=0;j<width;j++) {                                                                            
121        Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
122      }
123      Yp[k*width + i] *= DF(Ap[k*width + i]);
124    }
125  }

I am compiling with gcc -O3 -ffast-math -fopt-info -std=c11 ./neural.c -o neural -lm

Is there a good way to make this vectorize? Can you refer me to further information?

Is my method for indexing a bad idea (ie the k*width*width + ...)? I need to dynamically allocate, and I thought that keeping things close in memory would better enable optimizations.

EDIT: This might be useful

The -fopt-info-missed output for these lines

./neural.c:114:3: note: not vectorized: multiple nested loops.
./neural.c:114:3: note: bad loop form.
./neural.c:116:5: note: not vectorized: control flow in loop.
./neural.c:116:5: note: bad loop form.
./neural.c:119:7: note: step unknown.
./neural.c:119:7: note: reduction used in loop.
./neural.c:119:7: note: Unknown def-use cycle pattern.
./neural.c:119:7: note: not vectorized: complicated access pattern.
./neural.c:119:7: note: bad data access.
./neural.c:110:21: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:58: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:62: note: not vectorized: not enough data-refs in basic block.
./neural.c:117:18: note: not vectorized: not enough data-refs in basic block.
./neural.c:114:37: note: not vectorized: not enough data-refs in basic block.

EDIT:

Minimal example is HERE

I am trying it with BLAS. In the minimal example, it goes faster, but on the whole code it is slower ... not sure why

EDIT:

Compiler was optimizing out code. Fixed. BLAS is now faster. The fix was on the whole code, not the minimal example.

EDIT:

Same code as in the link from the previous edit

#include <math.h>
#include <cblas.h>
#include <stdlib.h>
#include <stdio.h>

typedef float value_type;
typedef unsigned long index_type;

static value_type F(value_type v) {
  return 1.0/(1.0 + exp(-v));
}

static value_type DF(value_type v) {
  const value_type Ev = exp(-v);
  return Ev/((1.0 + Ev)*(1.0 + Ev));
}

#ifndef WITH_BLAS

static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
           value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
           const index_type height, const index_type width) {
  for (index_type i=0;i<width;i++) {
    Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
  }

  for (index_type k=height-1;k+1>0;k--) {
    for (index_type i=0;i<width;i++) {
      Yp[k*width + i] = 0.0;
      for (index_type j=0;j<width;j++) {
    Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      }
      Yp[k*width + i] *= DF(Ap[k*width + i]);
    }
  }
}

#else

static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
           value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
           const index_type height, const index_type width) {
  for (index_type i=0;i<width;i++) {
    Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
  }

  for (index_type k=height-1;k+1>0;k--) {
    cblas_sgemv(CblasRowMajor, CblasTrans, width, width, 1,
        W+k*width*width, width, Yp+(k+1)*width, 1, 0, Yp+k*width, 1);
    for (index_type i=0;i<width;i++)
      Yp[k*width + i] *= DF(Ap[k*width + i]);
  }
}

#endif

int main() {
  const index_type height=10, width=10000;

  value_type *Ap=malloc((height+1)*width*sizeof(value_type)),
    *W=malloc(height*width*width*sizeof(value_type)),
    *Yp=malloc((height+1)*width*sizeof(value_type)),
    *Dp=malloc(width*sizeof(value_type));

  get_Yp(Ap, W, Yp, Dp, height, width);
  printf("Done %f\n", Yp[3]);

  return 0;
}

Solution

j-loop is well vectorizable SIMD reduction loop with constant stride of "width"-elements. You can vectorize it using modern compilers. This code is vectorizable with Intel Compiler and should be vectorizable by GCC under some conditions.
First of all, Reduction is a special case of "vectorizable" true loop-carried dependency. So you can't safely vectorize it unless "reduction" pattern is (a) auto-recognized by Compiler (not so easy and strictly speaking not so valid/expected behavior) or (b) communicated by developer to compiler explicitely using OpenMP or similar standards.

To "communicate" to compiler that there is a reduction - you need to use #pragma omp simd reduction (+ : variable_name) before j-loop.

This is only supported starting from OpenMP4.0. So you have to use GCC version which supports OpenMP4.x. Quote from https://gcc.gnu.org/wiki/openmp: "GCC 4.9 supports OpenMP 4.0 for C/C++, GCC 4.9.1 also for Fortran"

I would also use temporarily local variable for accumulating reduction ( OpenMP4.0 requires reduction variable to be used that way):

 tmpSUM = 0; 
 #pragma omp simd reduction (+: tmpSUM) 
 for (index_type j=0;j<width;j++) {                                                                            
        tmpSUM += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      }
 Yp[k*width + i] = tmpSUM

I'd also suggest to use signed int instead of unsigned, because unsinged induction variables are pretty bad for all modern vectorizers, at least by introducing extra overhead. I would not be surpised if using unsigned was one of the main reasons, "confused" GCC.
Now, you could be not-satisfied with my reply, because it tells about how it should work (and how it works in ICC/ICPC). It doesn't take into account specific nuances of GCC (which for reduction seemed to do oppostie), as one can see in GCC optimization report.

So, if you are still limited to GCC, I would suggest:

Make sure it's fresh enough GCC (4.9 or higer)
Use signed induction variable and still try omp simd reduction on temporary local tmp SUM(because it should enable more advanced vectorization techniques anyway)
If none above help, take a look at "weird" things like described here (very similar to your case): What do gcc's auto-vectorization messages mean? or consider using other compiler/compiler version.
1. Last comment: is access pattern in your code and more generally const-stride so bad? Answer: for some platforms/compilers const-stride will not kill your performance. But still, ideally you need more cache-friendly algorithm. Check Not sure how to explain some of the performance results of my parallelized matrix multiplication code . One more option is considering MKL or BLAS (like suggested by other reply) if your code is really memory-bound and if you don't have time to deal with memory optimizations yourself.