Using omp atomic operation in OpenMP for class members with Eigen Array

I doing simulations of N particles that interact between each other in a given range. To avoid a N^2 computation over the particle, I sort (spatially) the particles in an array in which I stored the index of one particle and then each particles points to another particle which is in the same box. I have already written a sequential code in C++ and I am trying to implement a OpenMP version to increase the number of particles. To define particles and the array I use two classes



class Boxes
{
    int m_NX;
    int m_NY;
    int m_Nboxes;
    Eigen::ArrayXi m_boxes;
...
};

class Particles
{
        int m_nbParticles;
        Eigen::ArrayXd m_positions;
        Eigen::ArrayXi m_nextParticles;
...
};

Then to sort the particles I doing this

void updateBoxes(Boxes &p_boxes, Particles &p_particles)
{
...
    for (int i = 0; i < p_particles.m_nbParticles; i++)
    {
        indexX = p_particles.position(i).x() / dX;
        indexY = p_particles.position(i).y() / dX;
        indexBox = indexX + NXboxes*indexY;
        p_particles.m_nextParticles[i] = p_boxes.m_boxes[indexBox];
        p_boxes.m_boxes[indexBox] = i;
    }
}

I try to parallelize this part by adding pragma omp atomic but I get an error at the compile

    #pragma omp parallel for
    for (int i = 0; i < p_particles.size(); i++)
    {
        indexX = p_particles.position(i).x() / dX;
        indexY = p_particles.position(i).y() / dX;
        indexBox = indexX + NXboxes*indexY;
        #pragma omp atomic
        p_particles.m_nextParticles[i] = p_boxes.m_boxes[indexBox];
        #pragma omp atomic
        p_boxes.m_boxes[indexBox] = i;

    }

But it doesn work and I get an error at compile time.

error: the statement for 'atomic' must be an expression statement of form '++x;', '--x;', 'x++;', 'x--;', 'x binop= expr;', 'x = x binop expr' or 'x = expr binop x', where x is an l-value expression with scalar type

I already parallelized the other part of the code and even if this part is around 8% of the total time for a single thread code, it becomes more and more important when I increase the number of thread. I am relatively new with OpenMP and I am stuck on this. What is the best way to parallelize this part of the code?

Solution

As I've mentioned in my comments, I think it is the wrong approach to start with this sort of linked list-like structure since pointer chasing will negatively impact performance, especially for large particle lists. But to show this, we have to make a benchmark and a minimum reproducible example. So I modify the shown code for this.

Original approach

struct Boxes
{
    int m_NX;
    int m_NY;
    double m_boxsize;
    Eigen::ArrayXi m_boxes;

    Boxes(int NX, int NY, double boxsize)
    : m_NX(NX),
      m_NY(NY),
      m_boxsize(boxsize),
      m_boxes(Eigen::ArrayXi::Constant(NX * NY, -1))
    {}
};

Since the original question didn't make this abundantly clear: m_boxes contains the index of one particle in the respective box. I chose to denote the end of the linked list (no particles in the box) as particle index -1.

struct Particles
{
    Eigen::Array2Xd m_positions;
    Eigen::ArrayXi m_nextParticles;

    explicit Particles(int N)
    : m_positions(Eigen::Array2Xd::Random(2, N).abs()),
      m_nextParticles(Eigen::ArrayXi::Constant(N, -1))
    {}
    int nbParticles() const noexcept
    { return static_cast<int>(m_positions.cols()); }
};

Again to explain the original design, m_nextParticles holds, for each particle, the index of the next one in the same box. And again -1 indicates the end of the list.

Two further changes / optimizations:

The member m_nbParticles was redundant as both arrays know their size and therefore the number of particles
I changed the array type to Array2Xd since we deal with 2D coordinates. One column per particle (Array2Xd instead of ArrayX2d) is more efficient, at least for the processing we do in my example, because Eigen is column-major
For testing, I build random particles in the range [0, 1)

void link_list(Boxes& p_boxes, Particles& p_particles)
{
    p_boxes.m_boxes.fill(-1);
    p_particles.m_nextParticles.fill(-1);
    const double boxscale = 1. / p_boxes.m_boxsize;
    for(int i = 0, n = p_particles.m_positions.cols(); i < n; ++i) {
        Eigen::Array2i coords =
            (p_particles.m_positions.col(i) * boxscale).cast<int>();
        int indexBox = coords.y() * p_boxes.m_NX + coords.x();
        p_particles.m_nextParticles[i] =
            std::exchange(p_boxes.m_boxes[indexBox], i);
    }
}

A slightly optimized version of the original that avoids the two divisions of the original. It is not parallelized and I don't think it can realistically be done. Maybe by replacing the std::exchange with an atomic exchange. But given the cost of that operation and the constant cache line bouncing it will produce, I doubt this is viable.

For benchmarking, we need some code to actually make use of the list. I chose a simple, parallelized iteration over each box that computes the center of mass per bin. Nothing fancy, nothing realistic, but the access pattern may be close to a real use case:

Eigen::Array2Xd traverse_list(const Boxes& p_boxes, Particles& p_particles)
{
    const int n = p_boxes.m_boxes.size();
    Eigen::Array2Xd center(2, n);
#   pragma omp parallel for
    for(int i = 0; i < n; ++i) {
        Eigen::Array2d sum = Eigen::Array2d::Zero();
        int pcount = 0;
        int particleIndex = p_boxes.m_boxes[i];
        while(particleIndex >= 0) {
            sum += p_particles.m_positions.col(particleIndex);
            pcount += 1;
            particleIndex = p_particles.m_nextParticles[particleIndex];
        }
        center.col(i) = sum / static_cast<double>(pcount);
    }
    return center;
}

My benchmark run looks like this:

int main()
{
    int n_particles = 10000000, n_boxes = 200, n_repetitions = 10, n_traversals = 10;
    Boxes boxes(n_boxes, n_boxes, 1. / n_boxes);
    Particles particles(n_particles);
    for(int i = 0; i < n_repetitions; ++i) {
        link_list(boxes, particles);
        for(int j = 0; j < n_traversals; ++j)
            traverse_list(boxes, particles);
    }
}

Runtime on my 8 core 16 thread system is

real    0m7,727s
user    1m56,050s
sys     0m0,385s

I expect performance to be worse on systems without hyperthreading where pointer chasing will stall the entire core.

Alternative approach

Instead of constructing a linked list, we can use a vector of particles per box. That avoids the pointer chasing, makes traversal easier, and allows some more independent operations. The memory requirement stays practically the same. Only the fixed overhead per box grows.

struct Boxes
{
    int m_NX;
    int m_NY;
    double m_boxsize;
    std::vector<std::vector<int>> m_boxes;

    Boxes(int NX, int NY, double boxsize)
    : m_NX(NX),
      m_NY(NY),
      m_boxsize(boxsize),
      m_boxes(static_cast<unsigned>(NX * NY))
    {}
    static Boxes empty_like(const Boxes& p_o)
    { return Boxes(p_o.m_NX, p_o.m_NY, p_o.m_boxsize); }
};
struct Particles
{
    Eigen::Array2Xd m_positions;

    explicit Particles(int N)
    : m_positions(Eigen::Array2Xd::Random(2, N).abs())
    {}
};

To fill the boxes, we start with a sequential algorithm for comparison:

void link_list(Boxes& p_boxes, Particles& p_particles)
{
    for(std::vector<int>& box: p_boxes.m_boxes)
        box.clear();
    const double boxscale = 1. / p_boxes.m_boxsize;
    for(int i = 0, n = p_particles.m_positions.cols(); i < n; ++i) {
        Eigen::Array2i coords =
            (p_particles.m_positions.col(i) * boxscale).cast<int>();
        unsigned indexBox = static_cast<unsigned>(
            coords.y() * p_boxes.m_NX + coords.x());
        p_boxes.m_boxes[indexBox].push_back(i);
    }
}

And the traversal function:

Eigen::Array2Xd traverse_list(const Boxes& p_boxes, Particles& p_particles)
{
    const std::size_t n = p_boxes.m_boxes.size();
    Eigen::Array2Xd center(2, static_cast<Eigen::Index>(n));
#   pragma omp parallel for
    for(std::size_t i = 0; i < n; ++i) {
        Eigen::Array2d sum = Eigen::Array2d::Zero();
        for(int particleIndex: p_boxes.m_boxes[i])
            sum += p_particles.m_positions.col(particleIndex);
        center.col(static_cast<Eigen::Index>(i)) =
            sum / static_cast<double>(p_boxes.m_boxes[i].size());
    }
    return center;
}

We still have a chaotic memory access pattern when reading the coordinates but we got rid of the loop-carried dependency on the next particle index. Instead, this code only has a loop-carried dependency chain on the sum, which of course is part of the example, not the real application. This runs at

real    0m5,032s
user    1m6,940s
sys     0m0,394s

We can try parallelizing the link_list function, too. However, the best I could come up with requires some temporary memory and cross-thread communication.

void link_list(Boxes& p_boxes, Particles& p_particles)
{
    const double boxscale = 1. / p_boxes.m_boxsize;
    const std::size_t n_boxes = p_boxes.m_boxes.size();
    const int n_particles = static_cast<int>(p_particles.m_positions.cols());
    using box_ptr = Boxes*;
    const std::unique_ptr<box_ptr[]> threadboxes =
        std::make_unique<box_ptr[]>(omp_get_max_threads());
#   pragma omp parallel
    {
        const int threadcount = omp_get_num_threads();
        const int threadid = omp_get_thread_num();
        Boxes own_boxes = Boxes::empty_like(p_boxes);
        threadboxes[threadid] = &own_boxes;
#       pragma omp for
        for(int i = 0; i < n_particles; ++i) {
            Eigen::Array2i coords =
                (p_particles.m_positions.col(i) * boxscale).cast<int>();
            unsigned indexBox = static_cast<unsigned>(
                coords.y() * own_boxes.m_NX + coords.x());
            own_boxes.m_boxes[indexBox].push_back(i);
        }
#       pragma omp for
        for(std::size_t i = 0; i < n_boxes; ++i) {
            // move to local to reduce cache-line bouncing
            std::vector<int> box = std::move(p_boxes.m_boxes[i]);
            box.clear();
            for(int j = 0; j < threadcount; ++j) {
                const std::vector<int>& threadbox =
                    threadboxes[j]->m_boxes[i];
                box.insert(box.end(), threadbox.begin(), threadbox.end());
            }
            p_boxes.m_boxes[i] = std::move(box);
        }
    }
}

The idea is pretty simple. We use two phases:

We create a Boxes structure per thread that each thread can fill individually with a subset of all particles
We concatenate all partial boxes into the final result

This is very similar to how one would parallelize a histogram computation. In fact, we run into many of the same issues and can probably apply the same techniques.

There are four problems with this approach:

The temporary memory allocations cost performance
When concatenating, we cause a lot of cross-CPU communication
We invoke several implicit thread barriers (one per loop)
The actual work we do to compute the box of each particle is very small. We start with an algorithm that is mostly memory-constrained and then add more memory operations on top

All in all, this is not faster, at least on my system.

real    0m5,852s
user    1m26,415s
sys     0m0,757s

Final thoughts

I already mentioned that this kind of fixed pattern bin may not be ideal. Other structures like R-trees are specifically built for checking nearest neighbours and similar tasks.

Boost's Spatial Indexes have a handy, if not well documented, implementation. A fixed grid can still work fine but mostly for densely packed and equally distributed data where you also know the search radii a-priori

If your particle dataset is mostly read-only, sorting into a k-d tree can be done with zero memory overhead and just std::nth_element

Once you start modifying the particles while iterating over the boxes, performance will suffer through cache-line bouncing and possibly running out of store-buffers. A chaotic write pattern by multiple threads on the same data structure is simply not ideal. If this is something you do, consider actually sorting the particles spatially so that particles in the same box are also located together. This applies to both approaches of traversing the blocks