c multithreading parallel-processing thread-safety openmp

Optimal parallelisation of communication from randomly distributed particles to regular grid

I am working on parallelizing my particle-in-cell code which I use to do simulations of deformation within the earth in 2D and 3D. Several routines of the code are easy to parallelize using OpenMP and scale very well. However, I have run into problems in a crucial part of the code that deals with interpolation from particles to a grid cells. The particles are moving around for each iteration (according to a velocity field). Many calculations are most effecient to perform on a regular, non-deforming grid. Therefore, each iteration involves communication from "randomly" distributed particles to the grid cells.

The problem can be illustrated in the following simplified 1D code:

//EXPLANATION OF VARIABLES (all previously allocated and initialized, 1D arrays)
//double *markerval; // Size Nm. Particle values. Are to be interpolated to the grid
//double *grid; // Size Ng=Nm/100 Grid values. 
//uint *markerpos; // Size Nm. Position of particles relative to grid (each particle
// knows what grid cell it belongs to) possible values are 0,1,...Ng-1

//#pragma omp parallel for schedule(static) private(e)
for (e=0; e<Nm; e++) {
    //#pragma omp atomic
    grid[markerpos[e]]+=markerval[e];
}

The particle positions are random in the worst case scenario, but typically, particles neighboring each other in memory, also neighbor each other in space and hence also in grid memory.

How do I parallelize this procedure efficiently? Several particles map to the same grid cell, so there is a nonzero chance of race conditions and cache swapping if the above loop is directly parallelized. Making the update atomic prevents race conditions, but makes the code much slower than the sequential case.

I also tried to make a private copy of grid values for each thread and then add them up subsequently. This, however, probably requires too much memory in the code to be used, and for this example, it did not scale so well with the number of threads (for reasons of which I am uncertain).

A third option might be to map from the grid to the particles and then loop through grid indices instead of particle indices. However, this, I fear, would be quite involved and require a major change of the code, and I am uncertain how much it would help, since it would also require the use of sort routines which would be computationally expensive as well.

Has anybody got any experience with this or a similar problem?

Solution

An option could be to map the iterations manually upon the threads:

#pragma omp parallel shared(Nm,Ng,markerval,markerpos,grid)
{
  int nthreads = omp_get_num_threads();
  int rank     = omp_get_thread_num();
  int factor   = Ng/nthreads;

  for (int e = 0; e < Nm; e++) {
    int pos = markerpos[e];
    if ( (pos/factor)%nthreads == rank )
      grid[pos]+=markerval[e];
  }
}

A few remarks:

The iteration of the for loop are not shared among threads. Instead each thread does all the iterations.
The conditional inside the for loop decides which thread will update location pos of the grid array. This location belongs to one thread only, thus the atomic protection is not needed.
The formula (pos/factor)%nthreads is just a simple heuristic. Any function of pos that returns a value in the range 0,...,nthreads-1 could in fact be substituted to this expression without compromising the validity of the final results (so feel free to change it if you have a better shot). Note that a poor choice of this function may result in load-balancing issues.