Search code examples
pythonpandasdataframecountbin

python pandas dataframe create bins only for data in threshold


In a python pandas dataframe "df", I have the following three columns:

song_id | user_id | play_count

play_count = how many times a user listened to a song

I am trying to add a column "rating" to this table based on play count. For example, if play_count =2, the rating will be low like "1".

First, I need to establish the rating threshold for my 1-10 rating system.

df.play_count.describe()
count    393727.000000
mean          2.567627
std           4.822111
min           1.000000
25%           1.000000
50%           1.000000
75%           2.000000
max         771.000000
Name: play_count, dtype: float64

Most of the play_counts are between 1 and 200:

pd.value_counts(pd.cut(df.play_count, bins = 10))
(0.23, 78]    393576
(78, 155]        129
(155, 232]        13
(232, 309]         6
(309, 386]         2
(694, 771]         1
(617, 694]         0
(540, 617]         0
(463, 540]         0
(386, 463]         0
dtype: int64

I would like to create 10 buckets, with the last bucket being that if the play_count is above 200, the song has a rating of "10". So I need to establish the thresholds of the other 9 buckets.

Unfortunately, this does not work:

pd.value_counts(pd.cut(df[['play_count'] < 200]], bins = 9))
f = df[df['play_count'] < 200].hist()

Solution

  • # get threshholds for first 9 bins
    _, bins = pd.cut(df[df.play_count < 200].play_count, bins=9,retbins=True)
    
    # append threshhold representing class with play_counts > 200
    new_bins = pd.np.append(bins,float(max(df.play_count)))
    
    # our categorized data
    out = pd.cut(df.play_count,bins=new_bins)
    
    # a histogram of the data with the updated bins
    df.play_count.hist(bins=new_bins)