I have a 2d array of data points (X) with corresponding observations (z) and would like to compute a grid of the mean values of z for each cell.
Using nested for loops with Numpy is inefficient. Is there a faster way using a built-in function or list comprehension? I would like to avoid Numba/jit if possible.
import numpy
x = numpy.random.rand(1000000)
y = numpy.random.rand(1000000)
z = numpy.random.rand(1000000)
nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
for j in range(ny):
zm[i,j] = numpy.mean(z, where = ((x>xl[i]) & (x<=xl[i+1]) & (y>yl[j]) & (y<=yl[j+1]))) #4.5 ms/loop = 75 minutes
Or, 2D variant:
import numpy
X = numpy.array([numpy.random.rand(1000000),numpy.random.rand(1000000)]).T
z = numpy.random.rand(1000000)
nx = 1000
ny = 1000
xl = numpy.linspace(0,1,nx+1)
yl = numpy.linspace(0,1,ny+1)
zm = numpy.full((nx,ny),numpy.nan)
for i in range(nx):
print(i)
for j in range(ny):
zm[i,j] = numpy.mean(z, where = ((X[:,0]>xl[i]) & (X[:,0]<=xl[i+1]) & (X[:,1]>yl[j]) & (X[:,1]<=yl[j+1]))) #4.5 ms/loop = 75 minutes
In pandas you could use:
import pandas as pd
df = pd.DataFrame({'x': pd.cut(x, xl, labels=range(nx)),
'y': pd.cut(y, yl, labels=range(ny)),
'z': z})
out = (df.groupby(['x', 'y'])['z'].mean().unstack()
.reindex(index=range(nx), columns=range(ny))
.to_numpy()
)
Running time:
3.95 s ± 1.83 s per loop (mean ± std. dev. of 7 runs, 1 loop each)