I'm playing with data analysis, using imdb dataset from kaggle.
I'm grouping several features like this:
color = df_q6.groupby('color', as_index=False).agg(profit_margin_mean=('profit_margin', 'mean'), \
year_mean=('title_year', 'mean'), \
critic_mean=('num_critic_for_reviews', 'mean'), \
duration_mean=('duration', 'mean'), \
cast_fb_mean=('cast_total_facebook_likes', 'mean'), \
user_review_mean=('num_user_for_reviews', 'mean'), \
imdb_mean=('imdb_score', 'mean')).round(2)
I'm trying to define a function to make such grouping easier.
So far I've done this:
def group_by_label(dataset, label, columns=[]):
agg_list = []
for i in columns:
agg_list.append("%s=(%s, 'mean')" % (i, i))
grouped = dataset.groupby(label, as_index=False).agg(*[x for x in agg_list])
return grouped
Which returns this:
AttributeError: 'profit_margin=(profit_margin, 'mean')' is not a valid function for 'DataFrameGroupBy' object
I had assumed(wrong) that this function should at least work when columns
has just 1 item.
Could you please suggest how can I make this work?
Thanks.
You want to use the .agg()
function and it takes a dictionary with column name as key and desired aggregation function as value so your aggregation function:
color = df_q6.groupby('color', as_index=False).agg(profit_margin_mean=('profit_margin', 'mean'), \
year_mean=('title_year', 'mean'), \
critic_mean=('num_critic_for_reviews', 'mean'), \
duration_mean=('duration', 'mean'), \
cast_fb_mean=('cast_total_facebook_likes', 'mean'), \
user_review_mean=('num_user_for_reviews', 'mean'), \
imdb_mean=('imdb_score', 'mean')).round(2)
should more like:
color = df_q6.groupby('color', as_index=False).agg({'profit_margin': 'mean',
'num_critic_for_reviews':'mean',
'duration': 'mean',
'cast_total_facebook_likes': 'mean',
'num_user_for_reviews': 'mean',
'imdb_score': 'mean'}).round(2)
If using a function like in your case make sure it formats columns and their aggregation functions similarly.