I have a dataset that is reporting values for a specific date, that can then be updated on subsequent dates, thus creating 2 columns, Date
and Reported_Date
, for each Reported_Value
. There is a separate ID
field that is my dataframe's index. I want to calculate the mode and max for the last 5 reported dates. I know I can use dataset['Reported_Value'].rolling(5).max()
to calculate the max, but trying rolling with mode leads to an error, 'Rolling' object has no attribute 'mode'
. Does anyone know how this could be achieved? Is there also a way to have it only compute across one Date? So that the first few values of 2021-12-02
weren't using 2021-12-01
values?
Example DataFrame:
ID Date Reported_Date Reported_Value Max_Last_5_Reported_Days
1 2021-12-01 2021-12-10 5 NaN
2 2021-12-01 2021-12-11 6 NaN
3 2021-12-01 2021-12-12 5 NaN
4 2021-12-01 2021-12-13 3 NaN
5 2021-12-01 2021-12-14 2 6
6 2021-12-01 2021-12-15 11 11
7 2021-12-01 2021-12-16 7 11
8 2021-12-01 2021-12-17 5 11
9 2021-12-01 2021-12-18 6 11
10 2021-12-01 2021-12-19 7 11
11 2021-12-02 2021-12-10 2 7
12 2021-12-02 2021-12-11 3 7
13 2021-12-02 2021-12-12 2 7
14 2021-12-02 2021-12-13 4 7
15 2021-12-02 2021-12-14 4 4
16 2021-12-02 2021-12-15 4 4
17 2021-12-02 2021-12-16 3 4
18 2021-12-02 2021-12-17 4 4
19 2021-12-02 2021-12-18 2 4
20 2021-12-02 2021-12-19 4 4
Desired DataFrame:
ID Date Reported_Date Reported_Value Max_Last_5_Report_Days Mode_L5RD
1 2021-12-01 2021-12-10 5 NaN NaN
2 2021-12-01 2021-12-11 6 NaN NaN
3 2021-12-01 2021-12-12 5 NaN NaN
4 2021-12-01 2021-12-13 3 NaN NaN
5 2021-12-01 2021-12-14 2 6 5
6 2021-12-01 2021-12-15 11 11 NaN
7 2021-12-01 2021-12-16 6 11 NaN
8 2021-12-01 2021-12-17 5 11 NaN
9 2021-12-01 2021-12-18 6 11 6
10 2021-12-01 2021-12-19 6 11 6
11 2021-12-02 2021-12-10 2 NaN NaN
12 2021-12-02 2021-12-11 3 NaN NaN
13 2021-12-02 2021-12-12 2 NaN NaN
14 2021-12-02 2021-12-13 4 NaN NaN
15 2021-12-02 2021-12-14 4 4 4
16 2021-12-02 2021-12-15 4 4 4
17 2021-12-02 2021-12-16 3 4 4
18 2021-12-02 2021-12-17 4 4 4
19 2021-12-02 2021-12-18 2 4 4
20 2021-12-02 2021-12-19 4 4 4
I'm not sure how I would convey that there are multiple mode values, so they're listed as NaN in the example.
groupby
"Date" and use rolling_max
for max for last 5 days; apply scipy.stats.mode
for mode:
from scipy.stats import mode
rolling_obj = df.groupby('Date')['Reported_Value'].rolling(5)
df['Max_Last_5_Report_Days'] = rolling_obj.max().droplevel(0)
df['Mode_L5RD'] = rolling_obj.apply(lambda x: mode(x)[0]).droplevel(0)
Output:
ID Date Reported_Date Reported_Value Max_Last_5_Reported_Days \
0 1 2021-12-01 2021-12-10 5 NaN
1 2 2021-12-01 2021-12-11 6 NaN
2 3 2021-12-01 2021-12-12 5 NaN
3 4 2021-12-01 2021-12-13 3 NaN
4 5 2021-12-01 2021-12-14 2 6.0
5 6 2021-12-01 2021-12-15 11 11.0
6 7 2021-12-01 2021-12-16 7 11.0
7 8 2021-12-01 2021-12-17 5 11.0
8 9 2021-12-01 2021-12-18 6 11.0
9 10 2021-12-01 2021-12-19 7 11.0
10 11 2021-12-02 2021-12-10 2 7.0
11 12 2021-12-02 2021-12-11 3 7.0
12 13 2021-12-02 2021-12-12 2 7.0
13 14 2021-12-02 2021-12-13 4 7.0
14 15 2021-12-02 2021-12-14 4 4.0
15 16 2021-12-02 2021-12-15 4 4.0
16 17 2021-12-02 2021-12-16 3 4.0
17 18 2021-12-02 2021-12-17 4 4.0
18 19 2021-12-02 2021-12-18 2 4.0
19 20 2021-12-02 2021-12-19 4 4.0
Max_Last_5_Report_Days Mode_L5RD
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 6.0 5.0
5 11.0 2.0
6 11.0 2.0
7 11.0 2.0
8 11.0 2.0
9 11.0 7.0
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 4.0 2.0
15 4.0 4.0
16 4.0 4.0
17 4.0 4.0
18 4.0 4.0
19 4.0 4.0