I make predictions on 5 dates using increments as increments = [0, 3, 6, 10, 15, 21]
. So for any start prediction is made on day + 0, day + 3 and so on. Suppose that the predicted df can be generated using code below.
timezone = pytz.timezone('US/Pacific')
start_date_pst = datetime.now(timezone).date() + timedelta(days=4)
increments = [0, 3, 6, 10, 15, 21]
start_dates_list = []
for i in increments:
next_date = start_date_pst + pd.Timedelta(days=i)
start_dates_list.append(next_date)
testing_df = pd.DataFrame(start_dates_list*2, columns = ['START_DATE'])
testing_df['START_DATE'] = pd.to_datetime(testing_df['START_DATE'])
testing_df['Code'] = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
testing_df['Preds'] = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200]
So for dates that are not between day 0th and day + 3 I want to use data from day + 3 and replicate it for two days in between and change the dates. I can achieve desired output using code below
for i in range(0, len(increments)-1):
start_date = start_date_pst + pd.Timedelta(days=increments[i]+1)
end_date = start_date_pst + pd.Timedelta(days=increments[i+1]-1)
data_date = start_date_pst + pd.Timedelta(days=increments[i+1])
date_range = pd.date_range(start_date, end_date).tolist()
for date in date_range:
temp_df = testing_df[testing_df.START_DATE == str(data_date)]
temp_df['START_DATE'] = date
testing_df = pd.concat([testing_df, temp_df])
print(testing_df.START_DATE.unique())
testing_df = testing_df.reset_index(drop=True)
testing_df = testing_df.sort_values(by='START_DATE')
Although I get the desired results from above code but I want to vectorise the code as I would need to perform it on larger data set. What could be a more efficient approach to reach desired output?
Here's one approach:
out = (testing_df.groupby(testing_df['START_DATE']
.diff().dt.days.lt(0).cumsum().rename('GroupId')
)
.apply(lambda x: x.set_index('START_DATE')
.resample('D').bfill()
)
.reset_index(drop=False)
)
out
GroupId START_DATE Code Preds
0 0 2023-07-23 A 100
1 0 2023-07-24 B 200
2 0 2023-07-25 B 200
3 0 2023-07-26 B 200
4 0 2023-07-27 C 300
5 0 2023-07-28 C 300
6 0 2023-07-29 C 300
7 0 2023-07-30 D 400
8 0 2023-07-31 D 400
9 0 2023-08-01 D 400
10 0 2023-08-02 D 400
11 0 2023-08-03 E 500
12 0 2023-08-04 E 500
13 0 2023-08-05 E 500
14 0 2023-08-06 E 500
15 0 2023-08-07 E 500
16 0 2023-08-08 F 600
17 0 2023-08-09 F 600
18 0 2023-08-10 F 600
19 0 2023-08-11 F 600
20 0 2023-08-12 F 600
21 0 2023-08-13 F 600
22 1 2023-07-23 G 700
23 1 2023-07-24 H 800
24 1 2023-07-25 H 800
25 1 2023-07-26 H 800
26 1 2023-07-27 I 900
27 1 2023-07-28 I 900
28 1 2023-07-29 I 900
29 1 2023-07-30 J 1000
30 1 2023-07-31 J 1000
31 1 2023-08-01 J 1000
32 1 2023-08-02 J 1000
33 1 2023-08-03 K 1100
34 1 2023-08-04 K 1100
35 1 2023-08-05 K 1100
36 1 2023-08-06 K 1100
37 1 2023-08-07 K 1100
38 1 2023-08-08 L 1200
39 1 2023-08-09 L 1200
40 1 2023-08-10 L 1200
41 1 2023-08-11 L 1200
42 1 2023-08-12 L 1200
43 1 2023-08-13 L 1200
Explanation
testing_df
, a new group appears to start where the difference in days with the previous row in column START_DATE
is at least less than 0. So, let's use Series.diff
, then select the days
value with Series.dt.days
, evaluate less than 0
with Series.lt
and finally chain Series.cumsum
. Let's also chain Series.rename
to avoid duplicate column names (i.e., 2x "START_DATE"
). The resulting series will have values like below, with the first 1
starting the second group:testing_df['START_DATE'].diff().dt.days.lt(0).cumsum().rename('GroupId').tolist()
# [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
df.groupby
. We'll get groups like below:(testing_df.groupby(testing_df['START_DATE']
.diff().dt.days.lt(0).cumsum().rename('GroupId')
).get_group(0)
)
START_DATE Code Preds
0 2023-07-23 A 100
1 2023-07-26 B 200
2 2023-07-29 C 300
3 2023-08-02 D 400
4 2023-08-07 E 500
5 2023-08-13 F 600
.apply
with a lambda function. Inside the lambda fuction, we first need to make "START_DATE"
our index (using df.set_index
) and then apply df.resample
with an offset of "D"
(i.e., "days") and chain .bfill
.df.reset_index
.