I have the following DataFrame:
match_date player_id player__match_count__won
1 2012-09-25 10:00:00 23640 1
18 2012-09-25 10:00:00 17969 0
2 2012-09-26 16:00:00 17268 1
19 2012-09-26 16:00:00 11247 0
0 2012-09-25 00:00:00 23640 1
17 2012-09-25 00:00:00 17268 0
3 2012-09-29 00:00:00 12202 1
20 2012-09-29 00:00:00 23640 0
5 2012-10-02 14:20:00 9241 1
22 2012-10-02 14:20:00 23640 0
6 2012-10-03 11:20:00 17268 1
23 2012-10-03 11:20:00 23129 0
4 2012-10-02 00:00:00 17268 1
21 2012-10-02 00:00:00 8515 0
7 2012-10-05 12:20:00 17268 1
24 2012-10-05 12:20:00 9241 0
8 2012-10-06 12:00:00 17268 1
25 2012-10-06 12:00:00 940 0
9 2012-10-07 11:00:00 6764 1
26 2012-10-07 11:00:00 17268 0
10 2012-10-15 00:00:00 23640 1
27 2012-10-15 00:00:00 26418 0
11 2012-10-16 00:00:00 23640 1
28 2012-10-16 00:00:00 6764 0
12 2012-10-19 14:00:00 14527 1
29 2012-10-19 14:00:00 23640 0
13 2012-10-20 13:20:00 469 1
30 2012-10-20 13:20:00 17268 0
14 2012-10-26 00:00:00 23640 1
31 2012-10-26 00:00:00 11247 0
15 2012-10-27 00:00:00 7507 1
32 2012-10-27 00:00:00 23640 0
16 2012-10-29 17:15:00 72 1
33 2012-10-29 17:15:00 17268 0
I would like to get the sum of player__match_count__won
for a rolling 30 day period grouped by player ID excluding the current row in the rolling window. I'm using the below to return a DataFrame because in the real data I'll be performing the sum on multiple columns:
df.groupby("player_id").rolling("30d", on="match_date", closed="left").sum().sort_values("match_date")
However, I get back the following:
player_id match_date player__match_count__won
0 23640 2012-09-25 00:00:00 5.0 <- incorrect
17 17268 2012-09-25 00:00:00 5.0 <- incorrect
1 23640 2012-09-25 10:00:00 NaN
18 17969 2012-09-25 10:00:00 NaN
2 17268 2012-09-26 16:00:00 NaN
19 11247 2012-09-26 16:00:00 NaN
20 23640 2012-09-29 00:00:00 2.0
3 12202 2012-09-29 00:00:00 NaN
4 17268 2012-10-02 00:00:00 1.0
21 8515 2012-10-02 00:00:00 NaN
22 23640 2012-10-02 14:20:00 2.0
5 9241 2012-10-02 14:20:00 NaN
6 17268 2012-10-03 11:20:00 1.0
23 23129 2012-10-03 11:20:00 NaN
24 9241 2012-10-05 12:20:00 1.0
7 17268 2012-10-05 12:20:00 3.0
8 17268 2012-10-06 12:00:00 4.0
25 940 2012-10-06 12:00:00 NaN
9 6764 2012-10-07 11:00:00 NaN
26 17268 2012-10-07 11:00:00 5.0
10 23640 2012-10-15 00:00:00 2.0
27 26418 2012-10-15 00:00:00 NaN
28 6764 2012-10-16 00:00:00 1.0
11 23640 2012-10-16 00:00:00 3.0
12 14527 2012-10-19 14:00:00 NaN
29 23640 2012-10-19 14:00:00 4.0
30 17268 2012-10-20 13:20:00 5.0
13 469 2012-10-20 13:20:00 NaN
31 11247 2012-10-26 00:00:00 0.0
14 23640 2012-10-26 00:00:00 2.0
32 23640 2012-10-27 00:00:00 3.0
15 7507 2012-10-27 00:00:00 NaN
33 17268 2012-10-29 17:15:00 4.0
16 72 2012-10-29 17:15:00 NaN
If I add in a sort on match_date
upfront:
df.sort_values("match_date").groupby("player_id").rolling("30d", on="match_date", closed="left").sum().sort_values("match_date")
Then I get back the correct values:
player_id match_date player__match_count__won
17 17268 2012-09-25 00:00:00 NaN
0 23640 2012-09-25 00:00:00 NaN
1 23640 2012-09-25 10:00:00 1.0
18 17969 2012-09-25 10:00:00 NaN
19 11247 2012-09-26 16:00:00 NaN
2 17268 2012-09-26 16:00:00 0.0
20 23640 2012-09-29 00:00:00 2.0
3 12202 2012-09-29 00:00:00 NaN
21 8515 2012-10-02 00:00:00 NaN
4 17268 2012-10-02 00:00:00 1.0
22 23640 2012-10-02 14:20:00 2.0
5 9241 2012-10-02 14:20:00 NaN
23 23129 2012-10-03 11:20:00 NaN
6 17268 2012-10-03 11:20:00 2.0
24 9241 2012-10-05 12:20:00 1.0
7 17268 2012-10-05 12:20:00 3.0
8 17268 2012-10-06 12:00:00 4.0
25 940 2012-10-06 12:00:00 NaN
9 6764 2012-10-07 11:00:00 NaN
26 17268 2012-10-07 11:00:00 5.0
10 23640 2012-10-15 00:00:00 2.0
27 26418 2012-10-15 00:00:00 NaN
28 6764 2012-10-16 00:00:00 1.0
11 23640 2012-10-16 00:00:00 3.0
12 14527 2012-10-19 14:00:00 NaN
29 23640 2012-10-19 14:00:00 4.0
30 17268 2012-10-20 13:20:00 5.0
13 469 2012-10-20 13:20:00 NaN
31 11247 2012-10-26 00:00:00 0.0
14 23640 2012-10-26 00:00:00 2.0
32 23640 2012-10-27 00:00:00 3.0
15 7507 2012-10-27 00:00:00 NaN
33 17268 2012-10-29 17:15:00 4.0
16 72 2012-10-29 17:15:00 NaN
However, the DataFrame was already sorted by match_date
in the original example.
Why would this be happening?
Here's the Dataframe in dictionary form for easy replication:
from pandas import Timestamp
df = {
'match_date': {1: Timestamp('2012-09-25 10:00:00'), 18: Timestamp('2012-09-25 10:00:00'), 2: Timestamp('2012-09-26 16:00:00'), 19: Timestamp('2012-09-26 16:00:00'), 0: Timestamp('2012-09-25 00:00:00'), 17: Timestamp('2012-09-25 00:00:00'), 3: Timestamp('2012-09-29 00:00:00'), 20: Timestamp('2012-09-29 00:00:00'), 5: Timestamp('2012-10-02 14:20:00'), 22: Timestamp('2012-10-02 14:20:00'), 6: Timestamp('2012-10-03 11:20:00'), 23: Timestamp('2012-10-03 11:20:00'), 4: Timestamp('2012-10-02 00:00:00'), 21: Timestamp('2012-10-02 00:00:00'), 7: Timestamp('2012-10-05 12:20:00'), 24: Timestamp('2012-10-05 12:20:00'), 8: Timestamp('2012-10-06 12:00:00'), 25: Timestamp('2012-10-06 12:00:00'), 9: Timestamp('2012-10-07 11:00:00'), 26: Timestamp('2012-10-07 11:00:00'), 10: Timestamp('2012-10-15 00:00:00'), 27: Timestamp('2012-10-15 00:00:00'), 11: Timestamp('2012-10-16 00:00:00'), 28: Timestamp('2012-10-16 00:00:00'), 12: Timestamp('2012-10-19 14:00:00'), 29: Timestamp('2012-10-19 14:00:00'), 13: Timestamp('2012-10-20 13:20:00'), 30: Timestamp('2012-10-20 13:20:00'), 14: Timestamp('2012-10-26 00:00:00'), 31: Timestamp('2012-10-26 00:00:00'), 15: Timestamp('2012-10-27 00:00:00'), 32: Timestamp('2012-10-27 00:00:00'), 16: Timestamp('2012-10-29 17:15:00'), 33: Timestamp('2012-10-29 17:15:00')},
'player_id': {1: 23640, 18: 17969, 2: 17268, 19: 11247, 0: 23640, 17: 17268, 3: 12202, 20: 23640, 5: 9241, 22: 23640, 6: 17268, 23: 23129, 4: 17268, 21: 8515, 7: 17268, 24: 9241, 8: 17268, 25: 940, 9: 6764, 26: 17268, 10: 23640, 27: 26418, 11: 23640, 28: 6764, 12: 14527, 29: 23640, 13: 469, 30: 17268, 14: 23640, 31: 11247, 15: 7507, 32: 23640, 16: 72, 33: 17268},
'player__match_count__won': {1: 1, 18: 0, 2: 1, 19: 0, 0: 1, 17: 0, 3: 1, 20: 0, 5: 1, 22: 0, 6: 1, 23: 0, 4: 1, 21: 0, 7: 1, 24: 0, 8: 1, 25: 0, 9: 1, 26: 0, 10: 1, 27: 0, 11: 1, 28: 0, 12: 1, 29: 0, 13: 1, 30: 0, 14: 1, 31: 0, 15: 1, 32: 0, 16: 1, 33: 0},
}
df = pd.DataFrame(df)
You said
However, the DataFrame was already sorted by
match_date
in the original example
But after examining your data you provided at the end of your post, I found that are some 2012-09-25
dates coming after 2012-09-26
dates, and some 2012-10-02
dates coming after some 2012-10-03
dates.
Your data wasn't sorted perfectly originally. So df.sort_values("match_date")
makes it different, which is why you get different results.