Search code examples
pythonpandasanalytics

How do i use the groupby function on dataframe in Python


Input Dataframe:

      Last Updated  Downloads             Category  
0             2018      10000       ART_AND_DESIGN  
1             2018     500000       ART_AND_DESIGN  
2             2018    5000000       ART_AND_DESIGN  
3             2018   50000000       ART_AND_DESIGN  
4             2018     100000       ART_AND_DESIGN  
           ...        ...                  ...  
10838         2017       1000              MEDICAL  
10839         2015       1000  BOOKS_AND_REFERENCE  
10840         2018   10000000            LIFESTYLE  

The problem statement is: "For the years 2016,2017,2018 what is the category of apps that have got the most and the least downloads"

To solve this I used:

df1 = df_year_d.groupby(['Last Updated','Category']).sum()
print(df1)
                                    Downloads
Last Updated Category                        
2010         FAMILY                    100000
2011         BOOKS_AND_REFERENCE      1000000
             BUSINESS                    1000
             FAMILY                     50000
             GAME                    10100000
             LIBRARIES_AND_DEMO       1000000
             LIFESTYLE                 100000
             TOOLS                    5156100
2012         BUSINESS                   10000
             COMMUNICATION               1000
             FAMILY                    711210
             FINANCE                   100000
             GAME                     1050000
             HEALTH_AND_FITNESS       1100000
             LIBRARIES_AND_DEMO      10000000
             MEDICAL                   120000
             PHOTOGRAPHY               500000
             PRODUCTIVITY              100000
             SHOPPING                  100000
             TOOLS                     200000
2013         BOOKS_AND_REFERENCE         2000
             BUSINESS                   10300
             COMMUNICATION             151000
             EDUCATION                  50000
             FAMILY                  50338310
             FINANCE                    60100
             GAME                    40265250
             HEALTH_AND_FITNESS         10000
             HOUSE_AND_HOME            100000
             LIBRARIES_AND_DEMO       6000000
                                      ...
2018         BOOKS_AND_REFERENCE   1880913110
             BUSINESS               975227003
             COMICS                  55201050
             COMMUNICATION        32548874886
             DATING                 262259557
             EDUCATION              842800000
             ENTERTAINMENT         2836150000
             EVENTS                  15410330
             FAMILY                9020112207
             FINANCE                872763824
             FOOD_AND_DRINK         271663081
             GAME                 33052192901
             HEALTH_AND_FITNESS    1568697276
             HOUSE_AND_HOME         161847101
             LIBRARIES_AND_DEMO      16283100
             LIFESTYLE              468085968
             MAPS_AND_NAVIGATION    702264990
             MEDICAL                 50556517
             NEWS_AND_MAGAZINES    7491323670
             PARENTING               31140010
             PERSONALIZATION       2130701875
             PHOTOGRAPHY           9402062515
             PRODUCTIVITY         13963101723
             SHOPPING              3243802640
             SOCIAL               13924137461
             SPORTS                1540744703
             TOOLS                10633528879
             TRAVEL_AND_LOCAL      6846181981
             VIDEO_PLAYERS         5928936510
             WEATHER                407227020

[188 rows x 1 columns]

Now I need the Category which is Max and Min in the three respective years 2016,2017,2018 Please suggest any efficient way to solve this query in Python.


Solution

  • First filter by Series.isin and boolean indexing, so only processing rows which are necessary (reason is less rows processing is better performance).

    Because you want Category by some years, first create DataFrame by as_index=False in aggregation sum and then use DataFrameGroupBy.idxmax and DataFrameGroupBy.idxmin for index of minimal and maximal values per groups, so possible use DataFrame.loc for selecting, DataFrame.stack is used for convert rows to column:

    df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
    df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()
    
    
    df1 = df1.loc[df1.groupby('Last Updated')['Downloads'].agg(['idxmin','idxmax']).stack()]
    df1.set_index("Last Updated", inplace=True)
    df1=df1.loc[['2016','2017','2018']]
    
    
    print(df1)
        Last Updated             Category  Downloads
    7           2016             BUSINESS         10
    6           2016  BOOKS_AND_REFERENCE      10000
    14          2017         PRODUCTIVITY         10
    9           2017       ART_AND_DESIGN    2660000
    16          2018    AUTO_AND_VEHICLES        100
    15          2018       ART_AND_DESIGN   84345000
    

    Another idea is sortig by DataFrame.sort_values and using GroupBy.nth for first and last rows per groups:

    df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
    df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()
    
    df1 = (df1.sort_values(['Last Updated','Downloads'])
              .groupby('Last Updated', as_index=False)
              .nth([0,-1]))
    print(df1)
        Last Updated           Category  Downloads
    7           2016           BUSINESS         10
    8           2016            FINANCE      10000
    14          2017       PRODUCTIVITY         10
    9           2017     ART_AND_DESIGN    2660000
    16          2018  AUTO_AND_VEHICLES        100
    15          2018     ART_AND_DESIGN   84345000