Reducing execution time in python for nested loops

I have written the following simple code for a numerical simulation. My programming level is beginner.

import numpy as np
import time as t
start = t.time()
r=10**-8
alpha=60*(np.pi/180)
gamma_sa=58.6*10**-3 
gamma_sw=25*10**-3
gamma_pa=153*10**-3
gamma_pw=110*10**-3
gamma_aw=72.5*10**-3    
kt= 1.38*10**-23*293 
i=0
##############variables########################
omega=0 
zeta= (3/2 )*np.pi*r**3 *10**-3
dt=0.01 
std=np.sqrt(2*kt*zeta*dt)
for k in range(1,2):
    beta_i=[]
    j_i=[]
    B=[]
    time=np.arange(dt,10,dt)
    Fs_i=[]
    dE_i=[]
    j=0
    for i in range (len(time)):
        j_i.append(j)
        beta=(90-j)
        beta1=(90-j)*(np.pi/180)
        Fs=0
        Ft = (np.random.randn()*std*np.sqrt(dt))/zeta
        beta_i.append(beta)
        del(beta)
        j=(j+Ft+Fs)%360
    MSD=[]
    diff_i=[]
    tau_i=[]
    for l in range(1,len(time)):
        tau=l*dt
        tau_i.append(tau)
        del(tau)
        for i in range(1,(len(time)-l)):
            diff=(j_i[l+i]-j_i[i])**2*dt
            diff_i.append(diff)
        MSD_j=np.sum(diff_i)/np.max(time)
        MSD.append(MSD_j)
        del(MSD_j) 
    np.savetxt("MSD_no_fs%d"%k, MSD)
    np.savetxt("Tau_no_fs%d"%k,tau_i)
print(t.time() - start)

The code run successfully and the execution time is ~38s. But if I increase dt from.01 to .001 it seems taking unlimited time as the script keeps running without an error. Can someone explain the execution time dependency with respect to dt, range of k and range of time along with any efficient way to do that? Because I would like to go for dt=.0001, krange (0,100,dt) and time (dt,100,dt). What could be the best practice for this?

PS: 8 gb ram and 3.31 GHz v6 processor.

Solution

The length of time is inversely proportionate to the magnitude of dt. And you have a nested loop based on the length of time in each case. As long as that loop is your hottest code, your code will experience quadratic growth (O(n**2)). Stepping from 0.01 to 10 by 0.01 means a length of (close enough) 100 elements, and ~10,000 units of inner loop work. Doing the same thing from 0.001 to 10 by 0.001 means ~1000 elements, and ~1,000,000 units of work, an increase of 100x.

From a starting point of 38 seconds, that's pretty extreme; even if memory weren't an issue, you'd be looking at 3800 seconds (over an hour). And memory is an issue; your inner loop repeatedly appends to diff_i, so you're going from storing ~10,000 floats (which on CPython x64 builds seem to occupy 24 bytes a piece, plus 8 bytes for the reference in the list), which means you eat about a third of a MB of RAM. With dt of 0.001, that ends up closer to 32 MB, and going to 0.0001 would bring you up to 3.2 GB. Even if you have that much RAM, it means you're no longer benefiting from CPU caches for a lot of stuff, which will likely slow you even more than the straight CPU costs would indicate.

Your code takes very little advantage of numpy's features, and could likely be tightened up quite a bit. Doing so would save a great deal of memory, and allow most of the work to be pushed to single function calls with the loops performed in C, running much faster than the Python interpreter.

For the simplest improvement, take diff_i. Before the loop even runs, you know exactly how many elements it will have, and the calculation could be simplified to simple array operations, converting j_i to a numpy array before the outer loop even begins, and replacing the loop with a simple series of computations on j_i that directly produce diff_i as a numpy array with no Python level loops at all.

I haven't tested this (no numpy on this machine), but a rough stab at this would be to convert j_i immediately after it's been populated with:

j_i = np.array(j_i)

and declare diff_i as:

diff_i = np.array()

then replace:

for i in range(1,(len(time)-l)):
    diff=(j_i[l+i]-j_i[i])**2*dt
    diff_i.append(diff)

with:

new_diff = (j_i[l+1:] - j_i[:-(l+1)]) ** 2 * dt
diff_i = np.concatenate((diff_i, new_diff))