Search code examples
pythonpandaspandas-groupby

Python - Group / count a list of numbers by custom ranges


Good evening,

I hope that you are well.

I have a list of numbers which I would like to group into "bins" or "buckets" based on ranges that I define. I would like the output to display each group "name" and the total number of values that fall within that range.

For example :-

my_list = [1, 123123, 12, 982023, 24, 446, 903, 2004]

Example criteria

  • greater than 250,000 (output: 1)
  • greater than 100,000 but less than or equal to 250,000 (output: 1)
  • greater than 10,000 but less than or equal to 100,000 (output: 0)
  • greater than 1,000 but less than or equal to 10,000 (output: 1)
  • greater than 100 but less than or equal to 1000 (output: 2)
  • less than 100 (output: 3)

I could obviously achieve this in a very crude way by writing multiple conditional if statements but I am aware that there must be a more elegant way of achieving the result.

Various searches have indicate that I could possibly achieve this using pandas.cut / digitize however as of yet, I have been unsuccessful in achieving the required output.

Any assistance would be much appreciated.

Many thanks

James


Solution

  • you are right, you can use pd.cut combined with a groupby to achieve what you want.

    Step 1: Define data

    import pandas as pd
    import numpy as np
    
    my_list = [1, 123123, 12, 982023, 24, 446, 903, 2004]
    df = pd.DataFrame(my_list, columns=['data'])
    cut_edges = np.array([-np.inf, 100, 1000, 10000, 100000, 250000, np.inf])
    labels = ['less than 100', 'between 100 and 1,000', 'between 1,000 and 10,000', 'between 10,000 and 100,000', 'between 100,000 and 250,000', 'greater than 250,000']
    

    Step 2: Generate the category name using pd.cut, and set index for groupby later

    df['category'] = pd.cut(df['data'], cut_edges, labels=labels)
    df.set_index('category', append=False, inplace=True)
    

    Step 3: groupby to do the count

    df.groupby(level='category').count()
    

    Output:

    enter image description here

    EDIT

    As pointed out in the comments numpy.histogram is another possibly more concise approach which will work. This answer used pd.cut as it was specifically mentioned in the question.