Search code examples
pythonpandasvectorization

Populate Pandas DataFrame using Backfill


I make predictions on 5 dates using increments as increments = [0, 3, 6, 10, 15, 21]. So for any start prediction is made on day + 0, day + 3 and so on. Suppose that the predicted df can be generated using code below.

timezone = pytz.timezone('US/Pacific')

start_date_pst = datetime.now(timezone).date() +  timedelta(days=4)

increments = [0, 3, 6, 10, 15, 21] 

start_dates_list = []
for i in increments:
    next_date = start_date_pst + pd.Timedelta(days=i)
    start_dates_list.append(next_date)

testing_df = pd.DataFrame(start_dates_list*2, columns = ['START_DATE'])
testing_df['START_DATE'] = pd.to_datetime(testing_df['START_DATE'])
testing_df['Code']  = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
testing_df['Preds'] = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200]

So for dates that are not between day 0th and day + 3 I want to use data from day + 3 and replicate it for two days in between and change the dates. I can achieve desired output using code below

for i in range(0, len(increments)-1):
    start_date = start_date_pst + pd.Timedelta(days=increments[i]+1)
    end_date = start_date_pst + pd.Timedelta(days=increments[i+1]-1)
    data_date = start_date_pst + pd.Timedelta(days=increments[i+1])
    date_range = pd.date_range(start_date, end_date).tolist()
    for date in date_range:
        temp_df = testing_df[testing_df.START_DATE == str(data_date)]
        temp_df['START_DATE'] = date
        testing_df =  pd.concat([testing_df, temp_df])
        print(testing_df.START_DATE.unique())
testing_df = testing_df.reset_index(drop=True)
testing_df = testing_df.sort_values(by='START_DATE')

Although I get the desired results from above code but I want to vectorise the code as I would need to perform it on larger data set. What could be a more efficient approach to reach desired output?


Solution

  • Here's one approach:

    out = (testing_df.groupby(testing_df['START_DATE']
                              .diff().dt.days.lt(0).cumsum().rename('GroupId')
                              )
           .apply(lambda x: x.set_index('START_DATE')
                  .resample('D').bfill()
                  )
           .reset_index(drop=False)
           )
    
    out
    
        GroupId START_DATE Code  Preds
    0         0 2023-07-23    A    100
    1         0 2023-07-24    B    200
    2         0 2023-07-25    B    200
    3         0 2023-07-26    B    200
    4         0 2023-07-27    C    300
    5         0 2023-07-28    C    300
    6         0 2023-07-29    C    300
    7         0 2023-07-30    D    400
    8         0 2023-07-31    D    400
    9         0 2023-08-01    D    400
    10        0 2023-08-02    D    400
    11        0 2023-08-03    E    500
    12        0 2023-08-04    E    500
    13        0 2023-08-05    E    500
    14        0 2023-08-06    E    500
    15        0 2023-08-07    E    500
    16        0 2023-08-08    F    600
    17        0 2023-08-09    F    600
    18        0 2023-08-10    F    600
    19        0 2023-08-11    F    600
    20        0 2023-08-12    F    600
    21        0 2023-08-13    F    600
    22        1 2023-07-23    G    700
    23        1 2023-07-24    H    800
    24        1 2023-07-25    H    800
    25        1 2023-07-26    H    800
    26        1 2023-07-27    I    900
    27        1 2023-07-28    I    900
    28        1 2023-07-29    I    900
    29        1 2023-07-30    J   1000
    30        1 2023-07-31    J   1000
    31        1 2023-08-01    J   1000
    32        1 2023-08-02    J   1000
    33        1 2023-08-03    K   1100
    34        1 2023-08-04    K   1100
    35        1 2023-08-05    K   1100
    36        1 2023-08-06    K   1100
    37        1 2023-08-07    K   1100
    38        1 2023-08-08    L   1200
    39        1 2023-08-09    L   1200
    40        1 2023-08-10    L   1200
    41        1 2023-08-11    L   1200
    42        1 2023-08-12    L   1200
    43        1 2023-08-13    L   1200
    

    Explanation

    • First, identify the different groups. Given testing_df, a new group appears to start where the difference in days with the previous row in column START_DATE is at least less than 0. So, let's use Series.diff, then select the days value with Series.dt.days, evaluate less than 0 with Series.lt and finally chain Series.cumsum. Let's also chain Series.rename to avoid duplicate column names (i.e., 2x "START_DATE"). The resulting series will have values like below, with the first 1 starting the second group:
    testing_df['START_DATE'].diff().dt.days.lt(0).cumsum().rename('GroupId').tolist()
    # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
    • Next, we want to use this series inside df.groupby. We'll get groups like below:
    (testing_df.groupby(testing_df['START_DATE']
                        .diff().dt.days.lt(0).cumsum().rename('GroupId')
                        ).get_group(0)
     )
    
      START_DATE Code  Preds
    0 2023-07-23    A    100
    1 2023-07-26    B    200
    2 2023-07-29    C    300
    3 2023-08-02    D    400
    4 2023-08-07    E    500
    5 2023-08-13    F    600
    
    • On each of these groups we want to use .apply with a lambda function. Inside the lambda fuction, we first need to make "START_DATE" our index (using df.set_index) and then apply df.resample with an offset of "D" (i.e., "days") and chain .bfill.
    • At this stage, our result will have a MultiIndex, which we can reset, if necessary, with df.reset_index.