Search code examples
pythonpandaslambdasum

aggregation with lambda function for last 30 days with python


I'm trying to get a column named as 'sales_30d_lag' with aggregated sales of last 30 days from last 'Date' per user_id. when I run this code I get the result but when I merge it with the original dataframe based on user_id the 'sales_30d_lag' column shows NaN values - any thoughts on what's going wrong?

df_30d_lag= df.groupby(['user_ID']).apply(lambda df: df[(df['Date'] \
>=(df['Date'].max() -pd.to_timedelta(30, unit='d')))].agg({'sales': 'sum'}))
.rename(columns={'sales':'sales_30d_lag'})

Solution

  • It's hard to guess without data example (and merging code), the lambda itself looks fine - I tested it on this dataset:

    from io import StringIO
    data = """user_ID,Date,sales
    1,2012-09-01 10:00:00,10.0
    1,2012-09-02 11:00:00,10.0
    1,2012-09-03 12:00:00,10.0
    1,2012-10-01 13:00:00,10.0
    1,2012-10-02 14:00:00,10.0
    1,2012-10-03 15:00:00,10.0
    1,2012-10-04 16:00:00,10.0
    1,2012-11-01 17:00:00,10.0
    2,2012-09-01 18:00:00,20.0
    2,2012-09-02 19:00:00,20.0
    2,2012-09-03 20:00:00,20.0
    2,2012-09-04 21:00:00,20.0
    2,2012-09-05 22:00:00,20.0
    2,2012-09-06 23:00:00,
    3,2012-09-06 23:00:00,30.0"""
    df = pd.read_csv(StringIO(data), engine="python", parse_dates=["Date"])
    

    And the code gives right results:

    df_30d_lag = df.groupby(['user_ID']).apply(lambda df: \
            df[(df['Date'] >=(df['Date'].max() - pd.to_timedelta(30, unit='d')))]\
                .agg({'sales': 'sum'}))\
        .rename(columns={'sales':'sales_30d_lag'})
    
    #       sales_30d_lag
    #user_ID    
    #1      30.0
    #2      100.0
    #3      30.0
    

    Maybe, the merging itself is a problem - df_30d_lag is indexed by user_ID. To merge it you have to either reset index and merge on column user_ID or do something like that:

    df.merge(df_30d_lag, left_on='user_ID', right_index=True)
    
    #   user_ID Date                sales   sales_30d_lag
    #0  1       2012-09-01 10:00:00 10.0    30.0
    #1  1       2012-09-02 11:00:00 10.0    30.0
    #2  1       2012-09-03 12:00:00 10.0    30.0
    #3  1       2012-10-01 13:00:00 10.0    30.0
    #4  1       2012-10-02 14:00:00 10.0    30.0
    #5  1       2012-10-03 15:00:00 10.0    30.0
    #6  1       2012-10-04 16:00:00 10.0    30.0
    #7  1       2012-11-01 17:00:00 10.0    30.0
    #8  2       2012-09-01 18:00:00 20.0    100.0
    #9  2       2012-09-02 19:00:00 20.0    100.0
    #10 2       2012-09-03 20:00:00 20.0    100.0
    #11 2       2012-09-04 21:00:00 20.0    100.0
    #12 2       2012-09-05 22:00:00 20.0    100.0
    #13 2       2012-09-06 23:00:00 NaN     100.0
    #14 3       2012-09-06 23:00:00 30.0    30.0
    

    If it is not the case, please add the data example so we could reproduce it better.