I doing simulations of N
particles that interact between each other in a given range. To avoid a N^2
computation over the particle, I sort (spatially) the particles in an array in which I stored the index of one particle and then each particles points to another particle which is in the same box. I have already written a sequential code in C++ and I am trying to implement a OpenMP version to increase the number of particles. To define particles and the array I use two classes
class Boxes
{
int m_NX;
int m_NY;
int m_Nboxes;
Eigen::ArrayXi m_boxes;
...
};
class Particles
{
int m_nbParticles;
Eigen::ArrayXd m_positions;
Eigen::ArrayXi m_nextParticles;
...
};
Then to sort the particles I doing this
void updateBoxes(Boxes &p_boxes, Particles &p_particles)
{
...
for (int i = 0; i < p_particles.m_nbParticles; i++)
{
indexX = p_particles.position(i).x() / dX;
indexY = p_particles.position(i).y() / dX;
indexBox = indexX + NXboxes*indexY;
p_particles.m_nextParticles[i] = p_boxes.m_boxes[indexBox];
p_boxes.m_boxes[indexBox] = i;
}
}
I try to parallelize this part by adding pragma omp atomic
but I get an error at the compile
#pragma omp parallel for
for (int i = 0; i < p_particles.size(); i++)
{
indexX = p_particles.position(i).x() / dX;
indexY = p_particles.position(i).y() / dX;
indexBox = indexX + NXboxes*indexY;
#pragma omp atomic
p_particles.m_nextParticles[i] = p_boxes.m_boxes[indexBox];
#pragma omp atomic
p_boxes.m_boxes[indexBox] = i;
}
But it doesn work and I get an error at compile time.
error: the statement for 'atomic' must be an expression statement of form '++x;', '--x;', 'x++;', 'x--;', 'x binop= expr;', 'x = x binop expr' or 'x = expr binop x', where x is an l-value expression with scalar type
I already parallelized the other part of the code and even if this part is around 8% of the total time for a single thread code, it becomes more and more important when I increase the number of thread. I am relatively new with OpenMP and I am stuck on this. What is the best way to parallelize this part of the code?
As I've mentioned in my comments, I think it is the wrong approach to start with this sort of linked list-like structure since pointer chasing will negatively impact performance, especially for large particle lists. But to show this, we have to make a benchmark and a minimum reproducible example. So I modify the shown code for this.
struct Boxes
{
int m_NX;
int m_NY;
double m_boxsize;
Eigen::ArrayXi m_boxes;
Boxes(int NX, int NY, double boxsize)
: m_NX(NX),
m_NY(NY),
m_boxsize(boxsize),
m_boxes(Eigen::ArrayXi::Constant(NX * NY, -1))
{}
};
Since the original question didn't make this abundantly clear: m_boxes
contains the index of one particle in the respective box. I chose to denote the end of the linked list (no particles in the box) as particle index -1.
struct Particles
{
Eigen::Array2Xd m_positions;
Eigen::ArrayXi m_nextParticles;
explicit Particles(int N)
: m_positions(Eigen::Array2Xd::Random(2, N).abs()),
m_nextParticles(Eigen::ArrayXi::Constant(N, -1))
{}
int nbParticles() const noexcept
{ return static_cast<int>(m_positions.cols()); }
};
Again to explain the original design, m_nextParticles
holds, for each particle, the index of the next one in the same box. And again -1 indicates the end of the list.
Two further changes / optimizations:
m_nbParticles
was redundant as both arrays know their size and therefore the number of particlesArray2Xd
since we deal with 2D coordinates. One column per particle (Array2Xd
instead of ArrayX2d
) is more efficient, at least for the processing we do in my example, because Eigen is column-majorvoid link_list(Boxes& p_boxes, Particles& p_particles)
{
p_boxes.m_boxes.fill(-1);
p_particles.m_nextParticles.fill(-1);
const double boxscale = 1. / p_boxes.m_boxsize;
for(int i = 0, n = p_particles.m_positions.cols(); i < n; ++i) {
Eigen::Array2i coords =
(p_particles.m_positions.col(i) * boxscale).cast<int>();
int indexBox = coords.y() * p_boxes.m_NX + coords.x();
p_particles.m_nextParticles[i] =
std::exchange(p_boxes.m_boxes[indexBox], i);
}
}
A slightly optimized version of the original that avoids the two divisions of the original. It is not parallelized and I don't think it can realistically be done. Maybe by replacing the std::exchange
with an atomic exchange. But given the cost of that operation and the constant cache line bouncing it will produce, I doubt this is viable.
For benchmarking, we need some code to actually make use of the list. I chose a simple, parallelized iteration over each box that computes the center of mass per bin. Nothing fancy, nothing realistic, but the access pattern may be close to a real use case:
Eigen::Array2Xd traverse_list(const Boxes& p_boxes, Particles& p_particles)
{
const int n = p_boxes.m_boxes.size();
Eigen::Array2Xd center(2, n);
# pragma omp parallel for
for(int i = 0; i < n; ++i) {
Eigen::Array2d sum = Eigen::Array2d::Zero();
int pcount = 0;
int particleIndex = p_boxes.m_boxes[i];
while(particleIndex >= 0) {
sum += p_particles.m_positions.col(particleIndex);
pcount += 1;
particleIndex = p_particles.m_nextParticles[particleIndex];
}
center.col(i) = sum / static_cast<double>(pcount);
}
return center;
}
My benchmark run looks like this:
int main()
{
int n_particles = 10000000, n_boxes = 200, n_repetitions = 10, n_traversals = 10;
Boxes boxes(n_boxes, n_boxes, 1. / n_boxes);
Particles particles(n_particles);
for(int i = 0; i < n_repetitions; ++i) {
link_list(boxes, particles);
for(int j = 0; j < n_traversals; ++j)
traverse_list(boxes, particles);
}
}
Runtime on my 8 core 16 thread system is
real 0m7,727s
user 1m56,050s
sys 0m0,385s
I expect performance to be worse on systems without hyperthreading where pointer chasing will stall the entire core.
Instead of constructing a linked list, we can use a vector of particles per box. That avoids the pointer chasing, makes traversal easier, and allows some more independent operations. The memory requirement stays practically the same. Only the fixed overhead per box grows.
struct Boxes
{
int m_NX;
int m_NY;
double m_boxsize;
std::vector<std::vector<int>> m_boxes;
Boxes(int NX, int NY, double boxsize)
: m_NX(NX),
m_NY(NY),
m_boxsize(boxsize),
m_boxes(static_cast<unsigned>(NX * NY))
{}
static Boxes empty_like(const Boxes& p_o)
{ return Boxes(p_o.m_NX, p_o.m_NY, p_o.m_boxsize); }
};
struct Particles
{
Eigen::Array2Xd m_positions;
explicit Particles(int N)
: m_positions(Eigen::Array2Xd::Random(2, N).abs())
{}
};
To fill the boxes, we start with a sequential algorithm for comparison:
void link_list(Boxes& p_boxes, Particles& p_particles)
{
for(std::vector<int>& box: p_boxes.m_boxes)
box.clear();
const double boxscale = 1. / p_boxes.m_boxsize;
for(int i = 0, n = p_particles.m_positions.cols(); i < n; ++i) {
Eigen::Array2i coords =
(p_particles.m_positions.col(i) * boxscale).cast<int>();
unsigned indexBox = static_cast<unsigned>(
coords.y() * p_boxes.m_NX + coords.x());
p_boxes.m_boxes[indexBox].push_back(i);
}
}
And the traversal function:
Eigen::Array2Xd traverse_list(const Boxes& p_boxes, Particles& p_particles)
{
const std::size_t n = p_boxes.m_boxes.size();
Eigen::Array2Xd center(2, static_cast<Eigen::Index>(n));
# pragma omp parallel for
for(std::size_t i = 0; i < n; ++i) {
Eigen::Array2d sum = Eigen::Array2d::Zero();
for(int particleIndex: p_boxes.m_boxes[i])
sum += p_particles.m_positions.col(particleIndex);
center.col(static_cast<Eigen::Index>(i)) =
sum / static_cast<double>(p_boxes.m_boxes[i].size());
}
return center;
}
We still have a chaotic memory access pattern when reading the coordinates but we got rid of the loop-carried dependency on the next particle index. Instead, this code only has a loop-carried dependency chain on the sum, which of course is part of the example, not the real application. This runs at
real 0m5,032s
user 1m6,940s
sys 0m0,394s
We can try parallelizing the link_list
function, too. However, the best I could come up with requires some temporary memory and cross-thread communication.
void link_list(Boxes& p_boxes, Particles& p_particles)
{
const double boxscale = 1. / p_boxes.m_boxsize;
const std::size_t n_boxes = p_boxes.m_boxes.size();
const int n_particles = static_cast<int>(p_particles.m_positions.cols());
using box_ptr = Boxes*;
const std::unique_ptr<box_ptr[]> threadboxes =
std::make_unique<box_ptr[]>(omp_get_max_threads());
# pragma omp parallel
{
const int threadcount = omp_get_num_threads();
const int threadid = omp_get_thread_num();
Boxes own_boxes = Boxes::empty_like(p_boxes);
threadboxes[threadid] = &own_boxes;
# pragma omp for
for(int i = 0; i < n_particles; ++i) {
Eigen::Array2i coords =
(p_particles.m_positions.col(i) * boxscale).cast<int>();
unsigned indexBox = static_cast<unsigned>(
coords.y() * own_boxes.m_NX + coords.x());
own_boxes.m_boxes[indexBox].push_back(i);
}
# pragma omp for
for(std::size_t i = 0; i < n_boxes; ++i) {
// move to local to reduce cache-line bouncing
std::vector<int> box = std::move(p_boxes.m_boxes[i]);
box.clear();
for(int j = 0; j < threadcount; ++j) {
const std::vector<int>& threadbox =
threadboxes[j]->m_boxes[i];
box.insert(box.end(), threadbox.begin(), threadbox.end());
}
p_boxes.m_boxes[i] = std::move(box);
}
}
}
The idea is pretty simple. We use two phases:
Boxes
structure per thread that each thread can fill individually with a subset of all particlesThis is very similar to how one would parallelize a histogram computation. In fact, we run into many of the same issues and can probably apply the same techniques.
There are four problems with this approach:
All in all, this is not faster, at least on my system.
real 0m5,852s
user 1m26,415s
sys 0m0,757s
Boost's Spatial Indexes have a handy, if not well documented, implementation. A fixed grid can still work fine but mostly for densely packed and equally distributed data where you also know the search radii a-priori
If your particle dataset is mostly read-only, sorting into a k-d tree can be done with zero memory overhead and just std::nth_element