Search code examples
pythonarraysnumpyscipyscientific-computing

How to calculate mean of each bin after using `numpy.digitize` to split up a NumPy array?


I have an input array which is being split up into bins and I want to calculate the mean for those bins. Let's assume the following example:

>>> import numpy as np
>>> a = np.array([1.4, 2.6, 0.7, 1.1])

Which is being split up into bins by np.digitize:

>>> bins = np.arange(0, 2 + 1)
>>> indices = np.digitize(a, bins)
>>> indices
array([2, 3, 1, 2])

This does exactly what I expect it to do as you can see here more explicitly:

>>> for i in range(len(bins)):
...     f"bin where {i} <= x < {i + 1} contains {a[indices == i + 1]}"
... 
'bin where 0 <= x < 1 contains [0.7]'
'bin where 1 <= x < 2 contains [1.4 1.1]'
'bin where 2 <= x < 3 contains [2.6]'

However, now I want to get the mean for each bin. Doing it the non-NumPy way with a for loop would be like this:

>>> b = np.array([a[indices == i + 1].mean() for i in range(len(bins))])
>>> b
array([0.7 , 1.25, 2.6 ])

But using a for loop for this appears neither elegant (pythonic), nor efficient, as the list will have to be converted into a NumPy array with np.array afterwards.

What's the NumPy way to do this?


Solution

  • IIUC, this is bincount:

    np.bincount(indices-1,a)/np.bincount(indices-1)
    

    Output:

    array([0.7, 1.25, 2.6])