Search code examples
pythonpandasbinning

python bin data and return bin midpoint (maybe using pandas.cut and qcut)


Can I make pandas cut/qcut function to return with bin endpoint or bin midpoint instead of a string of bin label?

Currently

pd.cut(pd.Series(np.arange(11)), bins = 5)

0     (-0.01, 2]
1     (-0.01, 2]
2     (-0.01, 2]
3         (2, 4]
4         (2, 4]
5         (4, 6]
6         (4, 6]
7         (6, 8]
8         (6, 8]
9        (8, 10]
10       (8, 10]
dtype: category

with category / string values. What I want is

0     1.0
1     1.0
2     1.0
3     3.0
4     3.0

with numerical values representing edge or midpoint of the bin.


Solution

  • I see that this is an old post but I will take the liberty to answer it anyway.

    It is now possible (ref @chrisb's answer) to access the endpoints for categorical intervals using left and right.

    s = pd.cut(pd.Series(np.arange(11)), bins = 5)
    
    mid = [(a.left + a.right)/2 for a in s]
    Out[34]: [0.995, 0.995, 0.995, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]
    

    Since intervals are open to the left and closed to the right, the 'first' interval (the one starting at 0), actually starts at -0.01. To get a midpoint using 0 as the left value you can do this

    mid_alt = [(a.left + a.right)/2 if a.left != -0.01 else a.right/2 for a in s]
    Out[35]: [1.0, 1.0, 1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0]
    

    Or, you can say that the intervals are closed to the left and open to the right

    t = pd.cut(pd.Series(np.arange(11)), bins = 5, right=False)
    Out[38]: 
    0       [0.0, 2.0)
    1       [0.0, 2.0)
    2       [2.0, 4.0)
    3       [2.0, 4.0)
    4       [4.0, 6.0)
    5       [4.0, 6.0)
    6       [6.0, 8.0)
    7       [6.0, 8.0)
    8     [8.0, 10.01)
    9     [8.0, 10.01)
    10    [8.0, 10.01)
    

    But, as you see, you get the same problem at the last interval.