Search code examples
pythonpandasdataframegroup-by

Group by on entire dataframe vs group by on subset of columns of dataframe


I'm working in a codebase where I see a lot of groupby usage like this that operates on a subset of the columns of df

df[cols].groupby(some_column).nunique()[column2extract]

where cols includes some_column and column2extract, and in most coses cols = [some_column, column2extract]

Functionally, I think this is equivalent to

df.groupby(some_column).nunique()[column2extract]

Is there some advantage to the former that I should be aware of? I see this often throughout this codebase, and I feel I may be missing something.

Actually, I think the 2 are only equivalent when cols = [some_column, column2extract] and not necessarily equivalent when cols contain additional columns


Solution

  • First thing, (...).groupby(some_column).nunique()[column2extract] seems like a waste of resources. You would compute the nunique for all columns, then index those of interest.

    This should be:

    (...).groupby(some_column)[column2extract].nunique()
    

    So, df[cols].groupby(some_column).nunique()[column2extract] might be better if cols is [column2extract, some_column], but still an unnecessarily complicated syntax.

    The only advantage of df[cols].groupby(some_column).(...) is if some_column is a external Series and not a column name that is part of df.

    Thus, in my opinion, the best would be:

    df.groupby(some_column)[column2extract].nunique()
    

    If you want a Series as output, this is also an option:

    df[column2extract].groupby(df[some_column]).nunique()