Emulating a linked list using CUDA for cell list molecular dynamics

So I have a molecular dynamics (MD) code that uses CUDA to parallelize everything. This results in O(N^2) performance due to the pair-wise nature of the force calculations. We can bring this down to O(N) if the force is short range by dividing simulation space in cells of side r and only calculating the pairwise interaction from particles in neighboring cells. The usual implementation of this is using linked lists that cycle through the particle indices in the cell.

The implementation I was trying was emulating the linked list using two sets of arrays. One is the header and the other contains the indices that are linked. Roughly speaking header[i] will contain index j meaning particle j belongs to cell i, then lsc[j] will point to another particle j' that also belongs to cell i and so on. My attempt at implementing this idea was the following:

// Update the Cell_list
__host__
void update_cell(vec2 *pos, int *head, int *lsc, int n_part){
    int index = blockDim.x * blockIdx.x + threadIdx.x;

    int c; // Cell that atom[i] belongs to to be calculated with pos[i]

    lsc[index] = head[c];
    head[c] = index;
}

But I run into the problem that threads run in parallel (duh) so my very serial implementation does't work right. If only there was a method to assign both values at the same time I wouldn't have a problem. Do I need to rethink my implementation of Linked Lists from the roots?

Edit: I just remembered I have a very related question involving an alternative way of implementing this but I don't know if it's going to be any useful (I'm new to C and also CUDA). I was thinking of creating a struct like:

struct cell{
    int *indices;
};

That for each cell stores an array of indices. Then knowing that in each cell I can only have a maximum number of particles, maybe I can allocate memory to it and access it as if it was an array with 2 indices. Basically I would need a way to create 2D arrays and manipulate them in a kernel.

Solution

Simulating particles with short range interactions efficiently using a GPU is a challenging problem. You have alluded to the core strategy of partitioning the particles using a fixed grid to make it easier to find the interacting pairs. Efficiently parallelizing this step is the crux of the problem and specialized algorithms and data structures have been developed specifically for CUDA implementations.

Particle physics, especially with respect to video game physics, has produced several techniques for parallelizing the analogous problem of collision detection. Unsurprisingly, NVidia has devoted significant resources to optimizing this task (see GPU Gems 3). This maps directly to problem of finding the pairs of interacting particles.

The basic strategy is to keep a spatially sorted array of the particles (e.g. using Morton codes) such that particles that are spatially close together are also close together in the array (see Bounding Volume Hierarchy and especially Axis Aligned Bounding Box). Finding the pairs is a matter of traversing the list using a window wide enough to capture interacting particles. Instead of being O(N^2) it is O(N*k) where k is the window width and k << N. The linked Nvidia resources give a detailed CUDA implementation of this task.