Search code examples
pythonpandasstatisticsbinning

How can i split my data in pandas into specified buckets e.g. 40-40-20?


All,

i am trying to split my data into 3 buckets that is 40%, 40% and 20%. How can i do this using pandas? e.g. so you get the bottom lowest 40%, middle 40% and top 20% :

pd.cut(df['count'], 5,labels = ['1','2','3','4','5'],retbins=True)

above splits into 5 quintiles, but i would like to 40:40:20 percentiles defined.

any ideas?


Solution

  • You are on the right path. From the wording in your question I'm not sure if you want to bin the data based on the range of possible values or the actual distribution of values. I'll show both.

    Use pd.cut() for binning data based on the range of possible values. Use pd.qcut() for binning data based on the actual distribution of values.

    import pandas as pd
    import numpy as np
    
    data = np.random.randint(0, 100, 100)
    labels = ['Bottom 40%', 'Middle 40%', 'Top 20%']
    
    # bin data based on range of possible values
    df['possible'] = pd.cut(df['count'], [0, 40, 80, 100], labels=labels)
    
    # bin data based on distribution of values
    df['distribution'] = pd.qcut(df['count'], [0., .4, .8, 1.], labels=labels)
    
    top20possible = df.loc[df['possible'] == 'Top 20%']
    top20distribution = df.loc[df['distribution'] == 'Top 20%']