Search code examples
pythonpandasgroup-bykaggle

How can I calculate percentage of a groupby column and sort it by descending order?


Question: How can I calculate percentage of a groupby column and sort it by descending order ?

Desired output:

country            count     percentage
United States      2555        45%
India               923        12%
United Kingdom      397        4%
Japan               226        3%
South Korea         183        2% 

I did some research, looked at the Pandas Documentation, looked at other questions here on Stackoverflow without luck.

I tried the following:

#1 Try:

Df2 = df.groupby('country')['show_id'].count().nlargest()
df3 = df2.groupby(level=0).apply(lambda x: x/x.sum() * 100)

Output:

director
A. L. Vijay            100.0
A. Raajdheep           100.0
A. Salaam              100.0
A.R. Murugadoss        100.0
Aadish Keluskar        100.0
...
Çagan Irmak            100.0
Ísold Uggadóttir       100.0
Óskar Thór Axelsson    100.0
Ömer Faruk Sorak       100.0
Şenol Sönmez           100.0

Name: show_id, Length: 4049, dtype: float64

#2 Try:

df2 = df.groupby('country')['show_id'].count()
df2['percentage'] = df2['show_id']/6000

Output:

KeyError: 'show_id'

Sample of the dataset:

import pandas as pd
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',NaN,NaN],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'], 
'country':['United States, India, South Korea, China',
'United Kingdom','United States'], 
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})

Solution

  • This doesn't address rows where there are multiple countries in the "country" field, but the lines below should work for the other parts of the question:

    Create initial dataframe:

    df = pd.DataFrame({
    'show_id':['81145628','80117401','70234439'],
    'type':['Movie','Movie','TV Show'],
    'title':['Norm of the North: King Sized Adventure',
    'Jandino: Whatever it Takes',
    'Transformers Prime'],
    'director':['Richard Finn, Tim Maltby',0,0],
    'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
    'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'], 
    'country':['United States, India, South Korea, China',
    'United Kingdom','United States'], 
    'date_added':['September 9, 2019',
    'September 9, 2016',
    'September 8, 2018'],
    'release_year':['2019','2016','2013'],
    'rating':['TV-PG','TV-MA','TV-Y7-FV'],
    'duration':['90 min','94 min','1 Season'],
    'listed_in':['Children & Family Movies, Comedies',
    'Stand-Up Comedy','Kids TV'],
    'description':['Before planning an awesome wedding for his',
    'Jandino Asporaat riffs on the challenges of ra',
    'With the help of three human allies, the Autob']})
    

    Groupby country:

    df2 = df.groupby(by="country", as_index=False)['show_id']\
        .agg('count')
    

    Rename agg column:

    df2 = df2.rename(columns={'show_id':'count'})
    

    Create percentage column:

    df2['percent'] = (df2['count']*100)/df2['count'].sum()
    

    Sort descending:

    df2 = df2.sort_values(by='percent', ascending=False)
    

    Part of the issue in your Attempt #1 may have been that you didn't include the "by" parameter in your groupby function.