Search code examples
pythonpandasgroup-by

Counting number of unique values in groups


I have data where for multiple years, observations i are categorized in cat. An observation i can be in multiple categories in any year, but is unique across years. I am trying to count unique values for i by year, by cat, and by year and cat.

I'm learning Python (v3.12) & Pandas (v2.2.1). I can make this work, but only by creating separate tables for the counts, and merging them back in with the main data. See the example below. I suspect there is a better way to do this. Is there, and, if so, how?

import pandas as pd
df = pd.DataFrame(
        {'year': [2020,2020,2020,2021,2021,2022,2023,2023,2023,2023],
        'cat': [1,1,2,2,3,3,1,2,3,4],
        'i': ['a','a','b','c','d','e','f','f','g','g']
        })
df

df_cat = df.groupby('cat')['i'].nunique()
df_year = df.groupby('year')['i'].nunique()
df_catyear = df.groupby(['cat', 'year'])['i'].nunique()

df_merged = df.merge(df_cat, how='left', on='cat').rename(columns={'i_x': 'i', 'i_y': 'n_by_cat'})
df_merged = df_merged.merge(df_year, how='left', on='year').rename(columns={'i_x': 'i', 'i_y': 'n_by_year'})
df_merged = df_merged.merge(df_catyear, how='left', on=['cat', 'year']).rename(columns={'i_x': 'i', 'i_y': 'n_by_catyear'})

Solution

  • You could use a simple loop and groupby.transform:

    groups = ['cat', 'year', ['cat', 'year']]
    
    for g in groups:
        df[f"n_by_{''.join(g)}"] = df.groupby(g)['i'].transform('nunique')
    

    Output:

       year  cat  i  n_by_cat  n_by_year  n_by_catyear
    0  2020    1  a         2          2             1
    1  2020    1  a         2          2             1
    2  2020    2  b         3          2             1
    3  2021    2  c         3          2             1
    4  2021    3  d         3          2             1
    5  2022    3  e         3          1             1
    6  2023    1  f         2          2             1
    7  2023    2  f         3          2             1
    8  2023    3  g         3          2             1
    9  2023    4  g         1          2             1