I'm working board game data from BoardGameGeek, and I'd like to create a dataframe that groups board games by the minimum number of players AND by categories.
Here are the column names: ['name', 'category', 'playtime', 'playtime_num', 'avg_rating', 'num_ratings', 'min_players'].
I first created a new column called 'support_solo' based on 'min_players' that indicates whether or not the board game has support solo play or doesn't : 'support solo', 'not support solo'.
Then I created a groupby object:
grouped = popular.groupby(['support_solo', 'category'])
After that I called basic aggregate functions to get a breakdown of the number of games in each category, and for in each 'solo/not solo group', as well as average values for other fields such as playtime. However, I'm having trouble getting the game with the most number of ratings for each category. I used a helper function and a dictionary of all of the groupby aggregations:
def game_with_highest_ratings(group):
max_ratings_index = group['num_ratings'].idxmax()
return group.loc[max_ratings_index, 'name']
aggregations = {
'name': 'count', # total number of games in each category
'num_ratings': game_with_highest_ratings, # game with the most ratings in each category
'avg_rating': 'mean', # average rating of games in each category
'playtime_num': 'mean', # average playtime of games in each category
}
grouped_result = grouped.agg(aggregations)
I keep on getting KeyError: 'num_ratings', and I don't know how to fix this issue. I have already checked for correct column names. How can I fix this issue, or is there an alternative method?
agg
passes only a Series to the function, not the full DataFrame. Trying to slice here would search in the indices (group
is already 1D). You cannot access the other columns.
The most efficient workaround is likely to aggregate the idxmax
and then to post-process it.
aggregations = {
'name': 'count', # total number of games in each category
'num_ratings': 'idxmax', # game with the most ratings in each category
'avg_rating': 'mean', # average rating of games in each category
'playtime_num': 'mean', # average playtime of games in each category
}
grouped_result = grouped.agg(aggregations)
grouped_result['num_ratings'] = grouped_result['num_ratings'].map(popular['name'])
A not-as-clean solution could be to use the hardcoded external reference in the function:
def game_with_highest_ratings(group):
return popular.at[group.idxmax(), 'name']