Search code examples
arraysnumpymultidimensional-arraybinning

numpy binned mean, conserving extra axes


It seems I am stuck on the following problem with numpy.

I have an array X with shape: X.shape = (nexp, ntime, ndim, npart) I need to compute binned statistics on this array along npart dimension, according to the values in binvals (and some bins), but keeping all the other dimensions there, because I have to use the binned statistic to remove some bias in the original array X. Binning values have shape binvals.shape = (nexp, ntime, npart).

A complete, minimal example, to explain what I am trying to do. Note that, in reality, I am working on large arrays and with several hunderds of bins (so this implementation takes forever):

import numpy as np

np.random.seed(12345)

X = np.random.randn(24).reshape(1,2,3,4)
binvals = np.random.randn(8).reshape(1,2,4)
bins = [-np.inf, 0, np.inf]
nexp, ntime, ndim, npart = X.shape

cleanX = np.zeros_like(X)
for ne in range(nexp):
    for nt in range(ntime):
        indices = np.digitize(binvals[ne, nt, :], bins)
        for nd in range(ndim):
            for nb in range(1, len(bins)):
                inds = indices==nb
                cleanX[ne, nt, nd, inds] = X[ne, nt, nd, inds] - \
                     np.mean(X[ne, nt, nd, inds], axis = -1)

Looking at the results of this may make it clearer?

In [8]: X
Out[8]: 
array([[[[-0.20470766,  0.47894334, -0.51943872, -0.5557303 ],
         [ 1.96578057,  1.39340583,  0.09290788,  0.28174615],
         [ 0.76902257,  1.24643474,  1.00718936, -1.29622111]],

        [[ 0.27499163,  0.22891288,  1.35291684,  0.88642934],
         [-2.00163731, -0.37184254,  1.66902531, -0.43856974],
         [-0.53974145,  0.47698501,  3.24894392, -1.02122752]]]])

In [10]: cleanX
Out[10]: 
array([[[[ 0.        ,  0.67768523, -0.32069682, -0.35698841],
         [ 0.        ,  0.80405255, -0.49644541, -0.30760713],
         [ 0.        ,  0.92730041,  0.68805503, -1.61535544]],

        [[ 0.02303938, -0.02303938,  0.23324375, -0.23324375],
         [-0.81489739,  0.81489739,  1.05379752, -1.05379752],
         [-0.50836323,  0.50836323,  2.13508572, -2.13508572]]]])


In [12]: binvals
Out[12]: 
array([[[ -5.77087303e-01,   1.24121276e-01,   3.02613562e-01,
           5.23772068e-01],
        [  9.40277775e-04,   1.34380979e+00,  -7.13543985e-01,
          -8.31153539e-01]]])

Is there a vectorized solution? I thought of using scipy.stats.binned_statistic, but I seem to be unable to understand how to use it for this aim. Thanks!


Solution

  • Ok, I think I got it, mainly based on the answer by @jdehesa.

    clean2 = np.zeros_like(X)
    d = np.digitize(binvals, bins)
    for i in range(1, len(bins)):
        m = d == i
        minds = np.where(m)
        sl = [*minds[:2], slice(None), minds[2]]
        msum = m.sum(axis=-1)
        clean2[sl] = (X - \
                      (np.sum(X * m[...,np.newaxis,:], axis=-1) / 
                      msum[..., np.newaxis])[..., np.newaxis])[sl]
    

    Which gives the same results as my original code. On the small arrays I have in the example here, this solution is approximately three times as fast as the original code. I expect it to be way faster on larger arrays.

    Update:

    Indeed it's faster on larger arrays (didn't do any formal test), but despite this, it just reaches the level of acceptable in terms of performance... any further suggestion on extra vectoriztaions would be very welcome.