python algorithm numpy loops nested-loops

How to make Python to run 4 nested loops faster?

I have 4 nested loops in my Python code and it takes ages to finish all the loops when the grids are big. For example, here is a piece of my code:

from itertools import product

T        =  50
beta     =  0.98 
alpha    =  0.3
delta    =  0.1

xss      =  ((1 / beta - (1 - delta)) / alpha)**(1 / (alpha - 1))
yss      =  xss**alpha

x        =  np.linspace(0.8 * xss, 1.4 * xss, T,dtype='float32')
y        =  np.linspace(yss, 1.2 * yss, T,dtype='float32')

for i,j,a,b in product(range(T),range(T),range(T),range(T)):
    z[i*T+a,j*T+b]    =  x[i]**alpha + (1 - delta) * x[i] + y[j] - x[a] - y[b]

Is there any solution to help Python create matrix z faster? Thanks in advance.

Solution

Looping in Python is slow. As much as possible, try to keep it inside numpy. A good start should be to notice that you are creating a new matrix based on four parameters; convert the parameters to a 4-dimensional array, and let the numpy broadcasting take care of the looping for you. This one command should be all you need, and it will be very, very fast:

z4 = (
    (x ** alpha + (1 - delta) * x)[:, None, None, None]
    + y[None, None, :, None]
    - x[None, :, None, None]
    - y[None, None, None, :]
)

You may notice that everything that has to do with i index in your code has been placed here in the first dimension (by [:, None, None, None]); everything to do with j in the third (by [None, None, :, None]), etc. This will end up with shape [T, T, T, T], one dimension for each of the parameters (i, a, j and b in your code, respectively). After that, because the dimensions are ordered appropriately, you can collapse it to [T * T, T * T] shape by

z = np.reshape(z4, [T * T, T * T])

There is a slight difference in the results because my calculation stays entirely in float32 land, while yours wanders into float64 then gets forced into whatever z.dtype is (I can't see the definition of z in your code). This happens because x ** alpha and (1 - delta) * x will keep with x.dtype, being entirely inside numpy; but x[i] ** alpha and (1 - delta) * x[i] will not be so constrained. If you change your code to this, you can verify you are getting the same result:

x[i]**np.float32(alpha) + np.float32(1 - delta) * x[i] + y[j] - x[a] - y[b]

That said, this speed-up is only possible because x and y are numpy arrays: the fact that numpy has very efficient storage for numbers, that it does not need to allocate or deallocate things during calculation itself, and the fact that all this is done at a very low level, outside Pythonland. Given that you did not specify numpy in your tags, this may be cheating a little; but you do use numpy in your code so I figure it should be fine. If you are fine with a numpy answer, adding numpy tag would be good.

In pure Python, the best you can do is pre-calculate what you can:

xi = [vx**alpha + (1 - delta) * vx for vx in x]

then run your loop, replacing x[i]**alpha + (1 - delta) * x[i] with xi[i]. There is nothing more that can be done to make this faster.