Search code examples
numpynested-loops

What is the most efficient way in Python to compute mean values within a grid cell?


I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.

Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.

import numpy

x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((x>xl[i]) & (x<=xl[i+1]) & (y>yl[j]) & (y<=yl[j+1]))) #4.5 ms/loop = 75 minutes    

Or, 2D variant:

import numpy

X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)

nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
    print(i)
    for j in range(ny):
        zm[i,j] = numpy.mean(z, where = ((X[:,0]>xl[i]) & (X[:,0]<=xl[i+1]) & (X[:,1]>yl[j]) & (X[:,1]<=yl[j+1]))) #4.5 ms/loop = 75 minutes

Solution

  • In you could use:

    import pandas as pd
    
    df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
                       'y': pd.cut(y, yl, labels=range(ny)),
                       'z': z})
    
    out = (df.groupby(['x', 'y'])['z'].mean().unstack()
             .reindex(index=range(nx), columns=range(ny))
             .to_numpy()
          )
    

    Running time:

    3.95 s ± 1.83 s per loop (mean ± std. dev. of 7 runs, 1 loop each)