Question: How can I sort data by bins using groupby in pandas?
What I want is the following:
release_year listed_in
1920 Documentaries
1930 TV Shows
1940 TV Shows
1950 Classic Movies, Documentaries
1960 Documentaries
1970 Classic Movies, Documentaries
1980 Classic Movies, Documentaries
1990 Classic Movies, Documentaries
2000 Classic Movies, Documentaries
2010 Children & Family Movies, Classic Movies, Comedies
2020 Classic Movies, Dramas
To achieve this result I tried the following formula:
bins = [1925,1950,1960,1970,1990,2000,2010,2020]
groups = df.groupby(['listed_in', pd.cut(df.release_year, bins)])
groups.size().unstack()
It shows the following result:
release_year (1925,1950] (1950,1960] (1960,1970] (1970,1990] (1990,2000] (2000,2010] (2010, 2020]
listed_in
Action & Adventure 0 0 0 0 9 16 43
Action & Adventure, Anime Features, Children & Family Movies 0 0 0 0 0 0 1
Action & Adventure, Anime Features, Classic Movies 0 0 0 1 0 0 0
...
461 rows x 7 columns
I also tried the following formula:
df['release_year'] = df['release_year'].astype(str).str[0:2] + '0'
df.groupby('release_year')['listed_in'].apply(lambda x: x.mode().iloc[0])
The result was the following:
release_year
190 Dramas
200 Documentaries
Name: listed_in, dtype:object
Here is a sample of the dataset:
import pandas as pd
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',NaN,NaN],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
The simplest way to do this is use the first part of your code and simply make the last digit of the release_year
a 0
. Then you can .groupby
decades and get the most popular genres for each decade i.e. the mode
:
input:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',np.nan,np.nan],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
code:
df['release_year'] = df['release_year'].astype(str).str[0:3] + '0'
df = df.groupby('release_year', as_index=False)['listed_in'].apply(lambda x: x.mode().iloc[0])
df
output:
release_year listed_in
0 2010 Children & Family Movies, Comedies