Search code examples
pythonpandaspandas-groupby

Python function used with pandas groupby&aggregate


I'm playing with data analysis, using imdb dataset from kaggle.

I'm grouping several features like this:

color = df_q6.groupby('color', as_index=False).agg(profit_margin_mean=('profit_margin', 'mean'), \
                                               year_mean=('title_year', 'mean'), \
                                              critic_mean=('num_critic_for_reviews', 'mean'), \
                                              duration_mean=('duration', 'mean'), \
                                              cast_fb_mean=('cast_total_facebook_likes', 'mean'), \
                                              user_review_mean=('num_user_for_reviews', 'mean'), \
                                              imdb_mean=('imdb_score', 'mean')).round(2)

I'm trying to define a function to make such grouping easier.

So far I've done this:

def group_by_label(dataset, label, columns=[]):
   agg_list = []
   for i in columns:
       agg_list.append("%s=(%s, 'mean')" % (i, i))
   grouped = dataset.groupby(label, as_index=False).agg(*[x for x in agg_list])
   return grouped

Which returns this:

AttributeError: 'profit_margin=(profit_margin, 'mean')' is not a valid function for 'DataFrameGroupBy' object

I had assumed(wrong) that this function should at least work when columns has just 1 item.

Could you please suggest how can I make this work?

Thanks.


Solution

  • You want to use the .agg() function and it takes a dictionary with column name as key and desired aggregation function as value so your aggregation function:

    color = df_q6.groupby('color', as_index=False).agg(profit_margin_mean=('profit_margin', 'mean'), \
                                                   year_mean=('title_year', 'mean'), \
                                                  critic_mean=('num_critic_for_reviews', 'mean'), \
                                                  duration_mean=('duration', 'mean'), \
                                                  cast_fb_mean=('cast_total_facebook_likes', 'mean'), \
                                                  user_review_mean=('num_user_for_reviews', 'mean'), \
                                                  imdb_mean=('imdb_score', 'mean')).round(2)
    

    should more like:

    color = df_q6.groupby('color', as_index=False).agg({'profit_margin': 'mean',
                                                'num_critic_for_reviews':'mean',
                                                'duration': 'mean',
                                                'cast_total_facebook_likes': 'mean',
                                                'num_user_for_reviews': 'mean',
                                                'imdb_score': 'mean'}).round(2)
    

    If using a function like in your case make sure it formats columns and their aggregation functions similarly.