I have data where for multiple years, observations i are categorized in cat. An observation i can be in multiple categories in any year, but is unique across years. I am trying to count unique values for i by year, by cat, and by year and cat.
I'm learning Python (v3.12) & Pandas (v2.2.1). I can make this work, but only by creating separate tables for the counts, and merging them back in with the main data. See the example below. I suspect there is a better way to do this. Is there, and, if so, how?
import pandas as pd
df = pd.DataFrame(
{'year': [2020,2020,2020,2021,2021,2022,2023,2023,2023,2023],
'cat': [1,1,2,2,3,3,1,2,3,4],
'i': ['a','a','b','c','d','e','f','f','g','g']
})
df
df_cat = df.groupby('cat')['i'].nunique()
df_year = df.groupby('year')['i'].nunique()
df_catyear = df.groupby(['cat', 'year'])['i'].nunique()
df_merged = df.merge(df_cat, how='left', on='cat').rename(columns={'i_x': 'i', 'i_y': 'n_by_cat'})
df_merged = df_merged.merge(df_year, how='left', on='year').rename(columns={'i_x': 'i', 'i_y': 'n_by_year'})
df_merged = df_merged.merge(df_catyear, how='left', on=['cat', 'year']).rename(columns={'i_x': 'i', 'i_y': 'n_by_catyear'})
You could use a simple loop and groupby.transform
:
groups = ['cat', 'year', ['cat', 'year']]
for g in groups:
df[f"n_by_{''.join(g)}"] = df.groupby(g)['i'].transform('nunique')
Output:
year cat i n_by_cat n_by_year n_by_catyear
0 2020 1 a 2 2 1
1 2020 1 a 2 2 1
2 2020 2 b 3 2 1
3 2021 2 c 3 2 1
4 2021 3 d 3 2 1
5 2022 3 e 3 1 1
6 2023 1 f 2 2 1
7 2023 2 f 3 2 1
8 2023 3 g 3 2 1
9 2023 4 g 1 2 1