Search code examples
pythonpandasdataframerolling-computation

Pandas Rolling window with filtering condition to remove the some latest data


This is a follow-up question of this. I would like to perform a rolling window of the last n days but I want to filter out the latest x days from each window (x is smaller than n)

Here is an example:

d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
     'Date': ['08/01/2021',
              '27/01/2021',
              '05/02/2021',
              '10/02/2021',
              '17/02/2021',
              '18/02/2021',
              '20/02/2021',
              '21/02/2021',
              '22/02/2021',
              '29/03/2021'],
     'Earning': [40, 10, 20, 20, 40, 50, 100, 70, 80, 90]}

df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
df = df.sort_values('Date')
   Name       Date  Earning
0  Jack 2021-01-08       40
1   Jim 2021-01-27       10
2  Jack 2021-02-05       20
3   Jim 2021-02-10       20
4  Jack 2021-02-17       40
5  Jack 2021-02-18       50
6   Jim 2021-02-20      100
7  Jack 2021-02-21       70
8  Jane 2021-02-22       80
9  Jane 2021-03-29       90

I would like to

  • For each row, take the last 30 days of the same Name - call it a window
  • Remove the latest 20 days of each window (i.e. only take the earliest 10 days)
  • Calculate the sum on the Earning column

Expected outcome: (The two columns Window_From and Window_To are not needed. I only use them to demonstrate the mock data)

   Name       Date  Earning Window_From  Window_To   Sum
0  Jack 2021-01-08       40  2020-12-09 2020-12-19   0.0
1   Jim 2021-01-27       10  2020-12-28 2021-01-07   0.0
2  Jack 2021-02-05       20  2021-01-06 2021-01-16  40.0
3   Jim 2021-02-10       20  2021-01-11 2021-01-21   0.0
4  Jack 2021-02-17       40  2021-01-18 2021-01-28   0.0
5  Jack 2021-02-18       50  2021-01-19 2021-01-29   0.0
6   Jim 2021-02-20      100  2021-01-21 2021-01-31  10.0
7  Jack 2021-02-21       70  2021-01-22 2021-02-01   0.0
8  Jane 2021-02-22       80  2021-01-23 2021-02-02   0.0
9  Jane 2021-03-29       90  2021-02-27 2021-03-09   0.0

Solution

  • Easy solution

    Calculate 30 days and 20 days rolling sum then subtract 30 day sum from 20 day sum to get the effective rolling sum for first 10 days

    s1 = df.groupby('Name').rolling('30d', on='Date')['Earning'].sum()
    s2 = df.groupby('Name').rolling('20d', on='Date')['Earning'].sum()
    
    df.merge(s1.sub(s2).reset_index(name='sum'), how='left')
    

       Name       Date  Earning   sum
    0  Jack 2021-01-08       40   0.0
    1   Jim 2021-01-27       10   0.0
    2  Jack 2021-02-05       20  40.0
    3   Jim 2021-02-10       20   0.0
    4  Jack 2021-02-17       40   0.0
    5  Jack 2021-02-18       50   0.0
    6   Jim 2021-02-20      100  10.0
    7  Jack 2021-02-21       70   0.0
    8  Jane 2021-02-22       80   0.0
    9  Jane 2021-03-29       90   0.0