Search code examples

Group by on entire dataframe vs group by on subset of columns of dataframe

I'm working in a codebase where I see a lot of groupby usage like this that operates on a subset of the columns of df


where cols includes some_column and column2extract, and in most coses cols = [some_column, column2extract]

Functionally, I think this is equivalent to


Is there some advantage to the former that I should be aware of? I see this often throughout this codebase, and I feel I may be missing something.

Actually, I think the 2 are only equivalent when cols = [some_column, column2extract] and not necessarily equivalent when cols contain additional columns


  • First thing, (...).groupby(some_column).nunique()[column2extract] seems like a waste of resources. You would compute the nunique for all columns, then index those of interest.

    This should be:


    So, df[cols].groupby(some_column).nunique()[column2extract] might be better if cols is [column2extract, some_column], but still an unnecessarily complicated syntax.

    The only advantage of df[cols].groupby(some_column).(...) is if some_column is a external Series and not a column name that is part of df.

    Thus, in my opinion, the best would be:


    If you want a Series as output, this is also an option:
