Populate Pandas DataFrame using Backfill

I make predictions on 5 dates using increments as increments = [0, 3, 6, 10, 15, 21]. So for any start prediction is made on day + 0, day + 3 and so on. Suppose that the predicted df can be generated using code below.

timezone = pytz.timezone('US/Pacific')

start_date_pst = datetime.now(timezone).date() +  timedelta(days=4)

increments = [0, 3, 6, 10, 15, 21] 

start_dates_list = []
for i in increments:
    next_date = start_date_pst + pd.Timedelta(days=i)
    start_dates_list.append(next_date)

testing_df = pd.DataFrame(start_dates_list*2, columns = ['START_DATE'])
testing_df['START_DATE'] = pd.to_datetime(testing_df['START_DATE'])
testing_df['Code']  = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
testing_df['Preds'] = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200]

So for dates that are not between day 0th and day + 3 I want to use data from day + 3 and replicate it for two days in between and change the dates. I can achieve desired output using code below

for i in range(0, len(increments)-1):
    start_date = start_date_pst + pd.Timedelta(days=increments[i]+1)
    end_date = start_date_pst + pd.Timedelta(days=increments[i+1]-1)
    data_date = start_date_pst + pd.Timedelta(days=increments[i+1])
    date_range = pd.date_range(start_date, end_date).tolist()
    for date in date_range:
        temp_df = testing_df[testing_df.START_DATE == str(data_date)]
        temp_df['START_DATE'] = date
        testing_df =  pd.concat([testing_df, temp_df])
        print(testing_df.START_DATE.unique())
testing_df = testing_df.reset_index(drop=True)
testing_df = testing_df.sort_values(by='START_DATE')

Although I get the desired results from above code but I want to vectorise the code as I would need to perform it on larger data set. What could be a more efficient approach to reach desired output?

Solution

Here's one approach:

out = (testing_df.groupby(testing_df['START_DATE']
                          .diff().dt.days.lt(0).cumsum().rename('GroupId')
                          )
       .apply(lambda x: x.set_index('START_DATE')
              .resample('D').bfill()
              )
       .reset_index(drop=False)
       )

out

    GroupId START_DATE Code  Preds
0         0 2023-07-23    A    100
1         0 2023-07-24    B    200
2         0 2023-07-25    B    200
3         0 2023-07-26    B    200
4         0 2023-07-27    C    300
5         0 2023-07-28    C    300
6         0 2023-07-29    C    300
7         0 2023-07-30    D    400
8         0 2023-07-31    D    400
9         0 2023-08-01    D    400
10        0 2023-08-02    D    400
11        0 2023-08-03    E    500
12        0 2023-08-04    E    500
13        0 2023-08-05    E    500
14        0 2023-08-06    E    500
15        0 2023-08-07    E    500
16        0 2023-08-08    F    600
17        0 2023-08-09    F    600
18        0 2023-08-10    F    600
19        0 2023-08-11    F    600
20        0 2023-08-12    F    600
21        0 2023-08-13    F    600
22        1 2023-07-23    G    700
23        1 2023-07-24    H    800
24        1 2023-07-25    H    800
25        1 2023-07-26    H    800
26        1 2023-07-27    I    900
27        1 2023-07-28    I    900
28        1 2023-07-29    I    900
29        1 2023-07-30    J   1000
30        1 2023-07-31    J   1000
31        1 2023-08-01    J   1000
32        1 2023-08-02    J   1000
33        1 2023-08-03    K   1100
34        1 2023-08-04    K   1100
35        1 2023-08-05    K   1100
36        1 2023-08-06    K   1100
37        1 2023-08-07    K   1100
38        1 2023-08-08    L   1200
39        1 2023-08-09    L   1200
40        1 2023-08-10    L   1200
41        1 2023-08-11    L   1200
42        1 2023-08-12    L   1200
43        1 2023-08-13    L   1200

Explanation

First, identify the different groups. Given testing_df, a new group appears to start where the difference in days with the previous row in column START_DATE is at least less than 0. So, let's use Series.diff, then select the days value with Series.dt.days, evaluate less than 0 with Series.lt and finally chain Series.cumsum. Let's also chain Series.rename to avoid duplicate column names (i.e., 2x "START_DATE"). The resulting series will have values like below, with the first 1 starting the second group:

testing_df['START_DATE'].diff().dt.days.lt(0).cumsum().rename('GroupId').tolist()
# [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

Next, we want to use this series inside df.groupby. We'll get groups like below:

(testing_df.groupby(testing_df['START_DATE']
                    .diff().dt.days.lt(0).cumsum().rename('GroupId')
                    ).get_group(0)
 )

  START_DATE Code  Preds
0 2023-07-23    A    100
1 2023-07-26    B    200
2 2023-07-29    C    300
3 2023-08-02    D    400
4 2023-08-07    E    500
5 2023-08-13    F    600

On each of these groups we want to use .apply with a lambda function. Inside the lambda fuction, we first need to make "START_DATE" our index (using df.set_index) and then apply df.resample with an offset of "D" (i.e., "days") and chain .bfill.
At this stage, our result will have a MultiIndex, which we can reset, if necessary, with df.reset_index.