I'm working in a codebase where I see a lot of groupby usage like this that operates on a subset of the columns of df
df[cols].groupby(some_column).nunique()[column2extract]
where cols
includes some_column
and column2extract
, and in most coses cols = [some_column, column2extract]
Functionally, I think this is equivalent to
df.groupby(some_column).nunique()[column2extract]
Is there some advantage to the former that I should be aware of? I see this often throughout this codebase, and I feel I may be missing something.
Actually, I think the 2 are only equivalent when cols = [some_column, column2extract]
and not necessarily equivalent when cols
contain additional columns
First thing, (...).groupby(some_column).nunique()[column2extract]
seems like a waste of resources. You would compute the nunique
for all columns, then index those of interest.
This should be:
(...).groupby(some_column)[column2extract].nunique()
So, df[cols].groupby(some_column).nunique()[column2extract]
might be better if cols
is [column2extract, some_column]
, but still an unnecessarily complicated syntax.
The only advantage of df[cols].groupby(some_column).(...)
is if some_column
is a external Series and not a column name that is part of df
.
Thus, in my opinion, the best would be:
df.groupby(some_column)[column2extract].nunique()
If you want a Series as output, this is also an option:
df[column2extract].groupby(df[some_column]).nunique()