I am having trouble with OpenMP on C

I want to parallelize the for loops and I can't seem to grasp the concept, every time I try to parallelize them it still works but it slows down dramatically.

for(i=0; i<nbodies; ++i){
    for(j=i+1; j<nbodies; ++j) {
        d2 = 0.0;   
        
        for(k=0; k<3; ++k) {
            
            rij[k] = pos[i][k] - pos[j][k];
            
            d2 += rij[k]*rij[k];
        
        if (d2 <= cut2) {
           d = sqrt(d2);
           d3 = d*d2;
           
           for(k=0; k<3; ++k) {
                double f = -rij[k]/d3;
                forces[i][k] += f;
                forces[j][k] -= f;
           }
           
           ene += -1.0/d; 
        }
       }
    }
}

I tried using synchronization with barrier and critical in some cases but nothing happens or the processing simply does not end.

Update, this is the state I'm at right now. Working without crashes but calculation times worsen the more threads I add. (Ryzen 5 2600 6/12)

#pragma omp parallel shared(d,d2,d3,nbodies,rij,pos,cut2,forces) private(i,j,k) num_threads(n)
    {
        clock_t begin = clock();
       #pragma omp for schedule(auto)
        for(i=0; i<nbodies; ++i){
            
            for(j=i+1; j<nbodies; ++j) {
                d2 = 0.0;
                for(k=0; k<3; ++k) {
                    rij[k] = pos[i][k] - pos[j][k];
                    d2 += rij[k]*rij[k];    
                }
                
                if (d2 <= cut2) {
                    d = sqrt(d2);
                    d3 = d*d2;
                #pragma omp parallel for shared(d3) private(k) schedule(auto) num_threads(n)
                 for(k=0; k<3; ++k) {
                    double f = -rij[k]/d3;
                    #pragma omp atomic 
                    forces[i][k] += f;
                    #pragma omp atomic
                    forces[j][k] -= f;
                    }
                    
                    ene += -1.0/d; 
                }
            }
        }
    
        clock_t end = clock();
        double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
        #pragma omp single
        printf("Calculation time %lf sec\n",time_spent);
    }

I incorporated the timer in the actual parallel code (I think it is some milliseconds faster this way). Also I think I got most of the shared and private variables right. In the file it outputs the forces.

Solution

Solved, turns out all I needed was

#pragma omp parallel for nowait

Doesn't need the "atomic" either.

Weird solution, I don't fully understand how it works but it does also the output file has 0 corrupt results whatsoever.