Search code examples
pythonnumpyhistogrammedian

Operating on histogram bins Python


I am trying to find the median of values within a bin range generated by the np.histrogram function. How would I select the values only within the bin range and operate on those specific values? Below is an example of my data and what I am trying to do:

x = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]

y values can have any sort of x value associated with them, for example:

hist, bins = np.histogram(x)
hist = [129, 126, 94, 133, 179, 206, 142, 147, 90, 185] 
bins = [0.,         0.09999926, 0.19999853, 0.29999779, 0.39999706,
        0.49999632, 0.59999559, 0.69999485, 0.79999412, 0.8999933,
        0.99999265]

So, I am trying to find the median y value of the 129 values in the first bin generated, etc.


Solution

  • One way is with pandas.cut():

    >>> import pandas as pd
    >>> import numpy as np
    >>> np.random.seed(444)
    
    >>> x = np.random.randint(0, 25, size=100)
    >>> _, bins = np.histogram(x)
    >>> pd.Series(x).groupby(pd.cut(x, bins)).median()
    (0.0, 2.4]       2.0
    (2.4, 4.8]       3.0
    (4.8, 7.2]       6.0
    (7.2, 9.6]       8.5
    (9.6, 12.0]     10.5
    (12.0, 14.4]    13.0
    (14.4, 16.8]    15.5
    (16.8, 19.2]    18.0
    (19.2, 21.6]    20.5
    (21.6, 24.0]    23.0
    dtype: float64
    

    If you want to stay in NumPy, you might want to check out np.digitize().