python performance optimization simulation

How to improve the speed of an n body simulation type project?

I am making a particle simulation, where each particle has an attraction to each other particle, scaling with distance. In addition to this, I have added some other features such as eating food and using energy for the particles to move. I currently am simulating 100 particles, but the speed of the simulation is much slower than I would like at around 5 frames per second. I have optimised most of the calculations (for example, I no longer use trig - just square roots and basic operations), and am using JIT and vectorisation to speed up the maths and operate on all particles simultaneously. I am working in python, but the JIT should convert the intensive maths to a faster language.

To speed up the code I have made it so that the calculations are only made when particles are within a reasonable distance, I have also heard of splitting the simulation into chunks and simulating one chunk at a time, but this seems similar to what i am doing now and would involve a massive rewrite for mediocre performance gains compared to what I have.

Does anyone know of a way I can improve runtime? I would like to get it at 60 fps, so I am quite far out, but I think I have seen similar simulations which run much faster.

I have a lot of poorly written code, and I don't think that you can gain much from it, but here are the relevant bits:


@guvectorize([(float64[:], float64[:],  int32, float64[:], float64[:])],'(m),(n),(),(p)->(n)', nopython=True)
def mathStuff(ipos, j, interact_range, matrix, vels2):
    vels2[2] = 0 
    vels2[3] = 0
    if (ipos[0] != j[0] and ipos[1] != j[1]):
        delta1 = (ipos[0]-j[0])
        delta2 = (ipos[1]-j[1])
        dist = (math.sqrt(delta1**2 + delta2**2))
        if in_range(dist, interact_range): # if j is in range 
            dx,sector = distance_calc(dist, interact_range) # get sector the particles are in
            if sector <= 4: # if close by, calculate remainder of the sector and repel based on that + inverse square law 
                #dy/dx = 3
                dx = (dx + sector +1)/5
                vel_mag = -(1/dx)**2
            elif sector > 4 and sector <= 16: #medium distance: scale from 0 attraction to attraction matrrix attraction
                dx = dx + sector-5
                grad = matrix[int(j[4])]/12
                vel_mag = grad*dx 
            elif sector >16 and sector <= 26: # large distance: scale from attraction matrix attraction down to 0 
                dx = dx + sector-16
                grad = matrix[int(j[4])]
                vel_mag = grad -(grad/10)*dx
            else: # too far - 0 velocity
                vel_mag= 0 
            vels = trig(delta2,delta1, vel_mag) # get what the velocity should be 
            vels2[0] = vels[0]
            vels2[1] = vels[1]
        else:
            vels2[0] = 0.0
            vels2[1] = 0.0
        if in_range(dist, interact_range/4):
            vels2[2] = int(j[4])+1
    else:
        vels2[0] = 0.0
        vels2[1] = 0.0


 for c1,i in enumerate(particles): # for each particle
        i.timestep(positions[c1][0],positions[c1][1]) # update position  
        positions[c1][2] = 0 
        positions[c1][3] = 0
         
        matrixes = [i.attraction_matrix[key] for key in i.attraction_matrix]
        expand_pos = np.concatenate((positions,np.asarray([types]).T), axis=1)
        v = np.asarray(mathStuff([(y) for y in positions[c1]],expand_pos, int(interact_range), matrixes))


@jit(nopython=True)
def in_range(dist, interact_range):
    if dist < math.sqrt(2*interact_range**2):
        return True
    else:
        return False 

# calculate the sector that a particle is in (based on distance split into 32 sectors)
@jit(nopython=True)
def distance_calc(other_dist, interact_range):
    other_dist = other_dist/(math.sqrt(2*interact_range**2)/32)
    dx = other_dist%1 
    sector = other_dist-dx 
    return dx,sector

# get velx and vely from velTotal and posx and posy 
@jit(nopython=True)
def trig(a,b, vel_mag):
    angle = a/b
    sign = np.sign(b)
    sqrt = math.sqrt(1+angle**2)
    self_vel0 = (-vel_mag/sqrt)*sign
    self_vel1 = self_vel0*angle
    return self_vel0,self_vel1

Solution

Here is few improvements you can do:

if dist < math.sqrt(2*interact_range**2): can be replaced to if dist < math.sqrt(2) * interact_range:. It will be significantly faster because square root are expensive and math.sqrt(2) can be computed at compile-time by Numba. In fact, the in_range function can be replaced by return dist < math.sqrt(2) * interact_range.
other_dist/(math.sqrt(2*interact_range**2)/32) can be replaced by other_dist * (32/math.sqrt(2)) / interact_range for the same reasons.
You can use the compilation flag fastmath=True if you are sure there is no special values (like NaN, Inf, or subnormals) produced by your code (even temporary). This can decrease the accuracy of the output (by breaking the floating-point associativity rules and more generally the IEEE-754 ones).
If you do not use fastmath, then it is faster to use a*(1/12) than a/12 since multiplications are faster than divisions. Also, -(1/dx)**2 can be replaced by -1/dx**2.
In the signature, float64[:] can be replaced by float64[::1] if the input array is contiguous. This help Numba do better vectorize loops when it is possible and simple (not much here).
matrix[int(j[4])] can be replaced by matrix[np.uint32(j[4])] if you are sure the index value is positive. This helps Numba not to add some instructions to support wraparound.
int(j[4])+1 can be replaced by np.ceil(j[4]) since vels2[2] hold a floating-point value (one need to be careful if the input is negative though so it is better to check the output match with your needs in this case).
the guvectorize code can be parallelized (possibly with parallel=True or target='parallel'. However, this is certainly better to use Numba in the caller code (to avoid the overhead of calling a Numba function from Python which may be expensive in your case). Thus, this is certainly better to parallelize the outer-loop if possible. The later is only possible if there is no dependency (it is not clear to me if this is the case in the code). Indeed, parallelizing a loop with dependencies cause a race condition.
If speed matters more than accuracy, then you can use single-precision instead of double-precision.