Search code examples
pythonnumpyscipyscientific-computing

binning data in python with scipy/numpy


is there a more efficient way to take an average of an array in prespecified bins? for example, i have an array of numbers and an array corresponding to bin start and end positions in that array, and I want to just take the mean in those bins? I have code that does it below but i am wondering how it can be cut down and improved. thanks.

from scipy import *
from numpy import *

def get_bin_mean(a, b_start, b_end):
    ind_upper = nonzero(a >= b_start)[0]
    a_upper = a[ind_upper]
    a_range = a_upper[nonzero(a_upper < b_end)[0]]
    mean_val = mean(a_range)
    return mean_val


data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []

n = 0
for n in range(0, len(bins)-1):
    b_start = bins[n]
    b_end = bins[n+1]
    binned_data.append(get_bin_mean(data, b_start, b_end))

print binned_data

Solution

  • It's probably faster and easier to use numpy.digitize():

    import numpy
    data = numpy.random.random(100)
    bins = numpy.linspace(0, 1, 10)
    digitized = numpy.digitize(data, bins)
    bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
    

    An alternative to this is to use numpy.histogram():

    bin_means = (numpy.histogram(data, bins, weights=data)[0] /
                 numpy.histogram(data, bins)[0])
    

    Try for yourself which one is faster... :)