Following up with this question, now I would like to calculate the sum/mean of a different column given the same grouping on a rolling window.
Here is the code snippet to set up. I would like to calculate the sum/mean of earnings of each person per row for their past 30 days. Also, since my data is quite big (the real data has 1.7 mil rows), any approach with apply
will not work, I presume.
Pseudo:
Date
Name
sum
of Earning
and return it to the rowd = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
'Date': ['08/01/2021',
'27/01/2021',
'05/02/2021',
'10/02/2021',
'17/02/2021',
'18/02/2021',
'20/02/2021',
'21/02/2021',
'22/02/2021',
'29/03/2021'],
'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
This answer is based on @jezrael post so thank you very much for it. The difference is that this solution
def sum_of_last_n_days(df: pd.DataFrame, identifier: str, timestamp: str, sum_col: str, delta: int) -> pd.DataFrame:
col_name = "sum_%s" % identifier
temp_df = df.set_index(timestamp) \
.groupby(identifier, sort=False)[sum_col] \
.rolling('%sd' % delta, closed='both') \
.sum() \
.rename(col_name)
temp_df = temp_df[~temp_df.index.duplicated(keep="first")]
return df.merge(temp_df, how="left", left_on=[identifier, timestamp], right_index=True)
frequency_of_last_n_days(df, "Name", "Date", "Earning", 30)
The outcome
Name Date Earning sum
0 Jack 2021-01-08 40 40.0
1 Jim 2021-01-27 10 10.0
2 Jack 2021-02-05 20 60.0
3 Jim 2021-02-10 20 30.0
4 Jack 2021-02-17 40 60.0
5 Jack 2021-02-18 50 110.0
6 Jim 2021-02-20 100 130.0
7 Jack 2021-02-21 70 180.0
8 Jane 2021-02-22 80 80.0
9 Jane 2021-03-29 90 90.0