False Sharing in Hogwild! Algorithms

I am trying to implement the Hogwild! Linear SVM algorithm, but I am running into false sharing problems with my implementation.

My code is below, but the background is that I am trying to compute which samples fail my test and make and update which is given by that set of vectors. Hogwild! (as far as I understand) simply makes the update on the same memory totally asynchronously. This would create "noise" in a mathematical sense due to the improperly times updates.

Sadly, as I try to do these async updates, the L1 cache is invalidated and has to be re-fetched. Below is my code.

Is there a good way to fix this false sharing without losing the asynchronous? (I am more of a mathematician than a computer scientist). This mentions that using different optimization levels can fix this.

void update(size_t epoch, const double *X_data, const int *X_indices,
            const int *X_indptr, const int *Y, double *W,
            double reg, double step_size, size_t nodes,
            size_t X_height, size_t X_width) {

  size_t i, j;
  double step = step_size/(1 + epoch);
  double c;

#pragma omp parallel shared(W, X_data, X_indices, X_indptr, Y) private(i, j, c)
  {
#pragma for schedule(static)
    for (i=0;i<X_height;i++) {
      c = 0.0;
      for (j=X_indptr[i];j<X_indptr[i+1];j++)
        c += X_data[j]*W[X_indices[j]]; // Scaled to discount the MPI scaling                                                

      if (Y[i]*c > 1)
        continue;

      for (j=X_indptr[i];j<X_indptr[i+1];j++)
        W[X_indices[j]] += step*Y[i]*X_data[j]/(X_height*nodes);
    } // END FOR OMP PARALLELIZED                                                                                            

#pragma for schedule(static) // Might not do much                                                                            
    for (i=0;i<X_width;i++) // (1 - self.reg*step)*self.W/self.nodes +                                                       
      W[i] *= (1 - reg*step)/nodes;
  }
}

Solution

I don't know much about the algorithm you mention, but it looks to me it globally is much more memory bound than compute bound. To convince you, here is a quick rewrite of your code:

void update( size_t epoch, const double *X_data, const int *X_indices,
             const int *X_indptr, const int *Y, double *W,
             double reg, double step_size, size_t nodes,
             size_t X_height, size_t X_width ) {

    const double step = step_size / ( 1 + epoch );
    const double ratio = step / ( X_height * nodes );
    const double tapper = ( 1 - reg * step ) / nodes;

    #pragma omp parallel 
    {
        #pragma omp for schedule( static )
        for ( size_t i = 0; i < X_height; i++ ) {
            double c = 0;
            for ( int j = X_indptr[i]; j < X_indptr[i+1]; j++ ) {
                c += X_data[j] * W[X_indices[j]]; // Scaled to discount the MPI scaling
            }
            if ( Y[i] * c <= 1 ) {
                double ratioYi = Y[i] * ratio;
                for ( int j = X_indptr[i]; j < X_indptr[i+1]; j++ ) {
                    // ATTENTION: this will collide across threads and have undefined result BY DESIGN
                    W[X_indices[j]] += ratioYi * X_data[j];
                }
            }
        } // END FOR OMP PARALLELIZED

        #pragma omp for schedule( static ) // Might not do much
        for ( size_t i = 0; i < X_width; i++ ) { // (1 - self.reg*step)*self.W/self.nodes +
            W[i] *= tapper;
        }
    }
}

As you can see, I did a few changes. Most of them are purely stylistic (like indentation, spacing, variable declaration location, etc), but some are truly important. For example, by defining ratio, and ratioYi as shallow in loops as possible, I remove (or help the compiler to remove, would it have done it) most the of computations from the code. It becomes all of a sudden obvious that the code almost only accesses data and computes very little. Therefore, unless you have a multi-socket (or multi memory controller) shared memory machine, you won't see much speed-up (if any) out of this OpenMP parallelisation.

Moreover, the "by design" race conditions the algorithm accepts while updating W in parallel, even if justified in the paper you pointed, keep on puzzling me. I still wouldn't like to rely on an undefined behaviour for a computational code (or any code for that matter).

Anyway, assuming the code does what you want, scales and is indeed only limited by L1 cache invalidations due to false sharing (or indeed true sharing here, since you authorise data collisions), a possible "solution" would be to increase the size of your W array, by for example doubling its size, and to only store meaningful data every second index. In your algorithm, this wouldn't change anything. Simply, you would have to multiply by 2 X_indices. By doing that, you would even more limit the likelihood of false sharing by mechanically dividing by two the number of useful data stored in a single cache line. However, again, for a memory-bound code, increasing the memory consumption might not be the best idea ever... But since it is super straightforward a test, just try it and see if it gives you any benefit.

A final note also to say that your code had a bug in the OpenMP parallelisation, whereby you had #pragma for instead of #pragma omp for. Not sure if this was a typo while copying here, but better mentioning it just in case.