Input Dataframe:
Last Updated Downloads Category
0 2018 10000 ART_AND_DESIGN
1 2018 500000 ART_AND_DESIGN
2 2018 5000000 ART_AND_DESIGN
3 2018 50000000 ART_AND_DESIGN
4 2018 100000 ART_AND_DESIGN
... ... ...
10838 2017 1000 MEDICAL
10839 2015 1000 BOOKS_AND_REFERENCE
10840 2018 10000000 LIFESTYLE
The problem statement is: "For the years 2016,2017,2018 what is the category of apps that have got the most and the least downloads"
To solve this I used:
df1 = df_year_d.groupby(['Last Updated','Category']).sum()
print(df1)
Downloads
Last Updated Category
2010 FAMILY 100000
2011 BOOKS_AND_REFERENCE 1000000
BUSINESS 1000
FAMILY 50000
GAME 10100000
LIBRARIES_AND_DEMO 1000000
LIFESTYLE 100000
TOOLS 5156100
2012 BUSINESS 10000
COMMUNICATION 1000
FAMILY 711210
FINANCE 100000
GAME 1050000
HEALTH_AND_FITNESS 1100000
LIBRARIES_AND_DEMO 10000000
MEDICAL 120000
PHOTOGRAPHY 500000
PRODUCTIVITY 100000
SHOPPING 100000
TOOLS 200000
2013 BOOKS_AND_REFERENCE 2000
BUSINESS 10300
COMMUNICATION 151000
EDUCATION 50000
FAMILY 50338310
FINANCE 60100
GAME 40265250
HEALTH_AND_FITNESS 10000
HOUSE_AND_HOME 100000
LIBRARIES_AND_DEMO 6000000
...
2018 BOOKS_AND_REFERENCE 1880913110
BUSINESS 975227003
COMICS 55201050
COMMUNICATION 32548874886
DATING 262259557
EDUCATION 842800000
ENTERTAINMENT 2836150000
EVENTS 15410330
FAMILY 9020112207
FINANCE 872763824
FOOD_AND_DRINK 271663081
GAME 33052192901
HEALTH_AND_FITNESS 1568697276
HOUSE_AND_HOME 161847101
LIBRARIES_AND_DEMO 16283100
LIFESTYLE 468085968
MAPS_AND_NAVIGATION 702264990
MEDICAL 50556517
NEWS_AND_MAGAZINES 7491323670
PARENTING 31140010
PERSONALIZATION 2130701875
PHOTOGRAPHY 9402062515
PRODUCTIVITY 13963101723
SHOPPING 3243802640
SOCIAL 13924137461
SPORTS 1540744703
TOOLS 10633528879
TRAVEL_AND_LOCAL 6846181981
VIDEO_PLAYERS 5928936510
WEATHER 407227020
[188 rows x 1 columns]
Now I need the Category which is Max and Min in the three respective years 2016,2017,2018 Please suggest any efficient way to solve this query in Python.
First filter by Series.isin
and boolean indexing
, so only processing rows which are necessary (reason is less rows processing is better performance).
Because you want Category
by some years, first create DataFrame
by as_index=False
in aggregation sum
and then use DataFrameGroupBy.idxmax
and
DataFrameGroupBy.idxmin
for index of minimal and maximal values per groups, so possible use DataFrame.loc
for selecting, DataFrame.stack
is used for convert rows to column:
df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()
df1 = df1.loc[df1.groupby('Last Updated')['Downloads'].agg(['idxmin','idxmax']).stack()]
df1.set_index("Last Updated", inplace=True)
df1=df1.loc[['2016','2017','2018']]
print(df1)
Last Updated Category Downloads
7 2016 BUSINESS 10
6 2016 BOOKS_AND_REFERENCE 10000
14 2017 PRODUCTIVITY 10
9 2017 ART_AND_DESIGN 2660000
16 2018 AUTO_AND_VEHICLES 100
15 2018 ART_AND_DESIGN 84345000
Another idea is sortig by DataFrame.sort_values
and using GroupBy.nth
for first and last rows per groups:
df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()
df1 = (df1.sort_values(['Last Updated','Downloads'])
.groupby('Last Updated', as_index=False)
.nth([0,-1]))
print(df1)
Last Updated Category Downloads
7 2016 BUSINESS 10
8 2016 FINANCE 10000
14 2017 PRODUCTIVITY 10
9 2017 ART_AND_DESIGN 2660000
16 2018 AUTO_AND_VEHICLES 100
15 2018 ART_AND_DESIGN 84345000