gpu opencl game-physics gpgpu physics-engine

How to parallelize collision constraints in position based dynamics

I'm trying to implement PBD algorithm in OpenCL. In the original paper the algorithm is stated as

(1) forall vertices i
(2)     initialize x[i]=x0[i], v[i]=v0[i], w[i]=1/m[i]
(3)endfor
(4)loop
(5)    forall vertices i do v[i]←v[i]+∆t*w*i*f_ext(x[i])
(6)    dampVelocities(v1,...,vN)
(7)    forall vertices i do p[i]←x[i]+∆t*v[i]
(8)    forall vertices i do generateCollisionConstraints(xi→pi)
(9)    loop solverIterations times
(10)        projectConstraints(C1,...,CM+Mcoll,p1,...,pN)
(11)   endloop
(12)   forall vertices i
(13)       v[i]←(p[i]−x[i])/∆t(14)x[i]←p[i]
(15)   endfor
(16)   velocityUpdate(v1,...,vN)
(17)endloop

The step 8 is the most difficulut to efficiently parallelize. What I'm doing currently is to use a regular 3D grid (represented as a single buffer) where I first insert all the particles. Next I iterate neighboring cells and check for collisions. But what do I do when the collision is actually detected? I cannot hold a variable size vector, nor I can "append" constrains to a buffer because that would require atomic operations, which are not parallel. What are some efficient approaches for implementing PBD on GPU?

Solution

What I figured out so far is that probably the best approach to implement this is either

use atomicAdd to simulate stack, similarly to the way it's done in face culling https://vkguide.dev/docs/gpudriven/compute_culling/
preallocate a small stack per each thread and let each thread append the constraints to that private stack. Then you could efficiently join all those stacks into one large array in logarithmic time. To show it more visually consider a 3 stacks each of maximum length 2

Stack 1: |_|_|  (empty stack)
Stack 2: |A|_|  (stack with one element)
Stack 3: |B|C|  (stack fully filled up to its maximum length)

Then concatenate these stacks like so

|A|_|_|_|B|C|

and now iteratively count how much each element should be shifted, which is equal to the number of vacant elements to its left (inclusive). Create a second buffer that holds the number of those vacant elements. The initial state of this buffer is

|0|1|1|1|0|0|  (each cell is 0 if it is empty or 1 otherwise)

In the second iteration you sum values of cells at positions i and i+1 and store the result in i+1.

|0|1|2|2|1|0|  (each cell holds number of filled cells up to 2 places to the left, including itself)

In the third iteration you sum up values of cells i and i+2

|0|1|2|3|3|2|  (each cell holds number of filled cells up to 4 places to the left, including itself)

In the fourth iteration you sum up values of cells i and i+4

|0|1|2|3|3|3|  (each cell holds number of filled cells up to 8 places to the left, including itself)

Now you know that A should be shifted by 0 to the left, B by 3 and C by 3, arriving at final result

|A|B|C|_|_|_|