Search code examples
pythonhistogramfrequencybins

Split an array into data based on bins returned by numpy histogram


I have an array x with data like this: [3.1, 3.0, 3.3, 3.5, 3.8, 3.75, 4.0] etc. I have another variable y with corresponding 0s and 1s [0, 1, 0] I want to get from that new separate arrays to have that divided

freq, bins = np.histogram(X, 5)

That allows me to know the cutoffs for each bin. But how do I actually get that data? For example, if I have two bins (3 to 3.5 and 3.5 to 4), I want two get two arrays in return like this [3.1, 3.2, 3.4, ...] and [3.6, 3.7, 4, ...]. Also, I want the variable y to be broken and sorted in the same fashion.

Summary: I am looking for code to break x into bins with corresponding y values.

I thought about doing something using the bins variable, but I am not sure how to split the data based on the cutoffs. I appreciate any help.

If I graph a normal histogram of X, I get this: enter image description here

Using code:

d=plt.hist(X, 5, facecolor='blue', alpha=0.5)

Working Code:

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)


def getLists(a, b, bin_obj):
    index_list = []
    for left, right in pairwise(bin_obj):
        indices = np.where((a >= left) & (a < right))
        index_list += [indices[0]]
    X_ret = [a[i] for i in index_list]
    Y_ret = [b[i] for i in index_list]
    return (X_ret, Y_ret)
freq, bins = np.histogram(X[:, 0], 5)

Xnew, Ynew = getLists(X[:, 0], Y, bins)

Solution

  • There's a handful python function defined in the standard library.

    from itertools import tee
    
    def pairwise(iterable):
        "s -> (s0,s1), (s1,s2), (s2, s3), ..."
        a, b = tee(iterable)
        next(b, None)
        return zip(a, b)
    

    It can help you to iterate through your bins and get the indices of your elements.

    for left, right in pairwise(bins):
        indices = np.where((x >= left) & (x < right))
        print(x[indices], y[indices])