Search code examples
pythonpandasdataframestatisticsfrequency-distribution

pandas cut function's output is not understandable


ı want create frequency table with pandas ı tried this code but ı dont understand output also ı want change class range data:

Rank
1      18.42
2      20.93
3      31.50
4      23.99
5       5.65
       ...  
96     11.40
97      5.14
98     15.30
99     12.99
100     5.28
Name: book price, Length: 100, dtype: float64

code:

import math
cn = math.sqrt(len(top_books['book price']))

cr = (
    max(top_books['book price'])-min(top_books['book price'])
    )//cn

print("class number:",cn
      ,"\nclass range:",cr)

data = np.sort(top_books["book price"].values)
pd.cut(x=data,bins=int(cn))

output:

class number: 10.0 
class range: 4.0
[(2.734, 7.379], (2.734, 7.379], (2.734, 7.379], (2.734, 7.379], (2.734, 7.379], ..., (25.775, 30.374], (25.775, 30.374], (30.374, 34.973], (44.171, 48.77], (44.171, 48.77]]
Length: 100
Categories (10, interval[float64, right]): [(2.734, 7.379] < (7.379, 11.978] < (11.978, 16.577] < (16.577, 21.176] ... (30.374, 34.973] < (34.973, 39.572] < (39.572, 44.171] < (44.171, 48.77]]

expected output:

class number: 10.0 
class range: 4.0
[(2, 6] < (7, 11] < (12, 16] < (17, 21] ... (30, 34] < (35, 39] < (40, 44] < (45, 49]]

pd.cut's output is not understandable why there is same ranges in first part and how can ı change class range ı want (2,6],(7,11] ...

logic: ı want draw histogram graphic like this picture

I can not find any parameter about class range


Solution

  • That output is showing that you've successfully created the bins, now you just need to use those bins to group your original dataframe. You can then apply an aggregate function like count which you can use to calculate the amount of entries in each range.

    It also seems as if you want the ranges to be integer values and not floating points so you should be creating an IntervalIndex object using pd.interval_range. Here's an example of how you might do both of these things:

    import math
    
    cn = math.sqrt(len(top_books['book price']))
    
    cr = math.ceil((
        math.ceil(max(top_books['book price']))-int(min(top_books['book price'])
        ))/cn)
    
    print("class number:",cn
          ,"\nclass range:",cr)
    
    interval = pd.interval_range(start=int(min(top_books['book price'])), periods=10, freq=int(cr))
    data = np.sort(top_books["book price"].values)
    bins = pd.cut(x=data,bins=interval)
    binned_data = top_books.groupby(bins).agg(['count'])
    

    The cr value needs to be edited to create these ranges you desire. You want the closest integer lower than min(top_books['book price']) and the closest integer higher than max(top_books['book price']). You then want to divide that by the number of bins you desire and round that value up. That is your proper "cr".