Search code examples
pythonpandasfunctionfinance

unique value counts of products purchased by the customer in previous year


I am building a function that creates an aggregate table with certain variables developed from df. I think I got most things right, but I am struggling understanding how to count a number of distinct products bought by the customer in previous year.

Example of my df:

df = pd.DataFrame({'year': [2020, 2021, 2021, 2021, 2022],
                   'id': [3, 1, 1, 2, 1],
                   'price': [12, 20, 30, 40, 50],
                   'age': [20, 30, 30, 44, 31],
                   'product': [book, toy, book, tv, book],
                   'quantity': [2,1,2,5,9})

here is my code so far:

df['revenue'] = df.price*df.quantity
df['year'] = pd.DatetimeIndex(df['date']).year 

def table(df):
   return df.groupby(['year','id'])\
    .agg(revenue = ('revenue', 'sum'), age = ('age', 'unique'), product_year_before = ('product', 'nunique'))

It counts products correctly, but it needs to be for previous year and not for current year.


Solution

  • For previous year you can subtract one year in MultiIndex:

    df1 = table(df)
    
    s = df1['product_year_before'].rename(lambda x: x - 1, level=0)
    print (s)
    year  id
    2019  3     1
    2020  1     2
          2     1
    2021  1     1
    Name: product_year_before, dtype: int64
    

    All together - if need new column use DataFrame.join with s Series:

    def table(df):
       df1 = (df.groupby(['year','id'])
                .agg(revenue = ('revenue', 'sum'), 
                     age = ('age', 'unique'), 
                     product_year_before = ('product', 'nunique')))
       s = df1['product_year_before'].rename(lambda x: x - 1, level=0)
       return df1.drop('product_year_before', axis=1).join(s)
                           
    df1 = table(df)
    print (df1)
             revenue   age  product_year_before
    year id                                    
    2020 3        24  [20]                  NaN
    2021 1        80  [30]                  1.0
         2       200  [44]                  NaN
    2022 1       450  [31]                  NaN
    

    With original column:

    def table(df):
       df1 = (df.groupby(['year','id'])
                .agg(revenue = ('revenue', 'sum'), 
                     age = ('age', 'unique'), 
                     product_year= ('product', 'nunique')))
       s = df1['product_year'].rename(lambda x: x - 1, level=0)
       return df1.join(s.rename('product_year_before'))
                           
    df1 = table(df)
    print (df1)
             revenue   age  product_year  product_year_before
    year id                                                  
    2020 3        24  [20]             1                  NaN
    2021 1        80  [30]             2                  1.0
         2       200  [44]             1                  NaN
    2022 1       450  [31]             1                  NaN