Search code examples
pythonpandasextrapolation

List of dates between two date variables in pandas dataframe


I have a dataset like this

   Business_dt   Value_dt    ID
0  05/01/2021    01/01/2021  1
1  06/01/2021    01/01/2021  1
2  07/01/2021    01/01/2021  1
3  08/01/2021    01/01/2021  1
4  15/02/2021    13/02/2021  2
5  16/02/2021    13/02/2021  2
6  17/02/2021    13/02/2021  2
7  10/08/2021    10/08/2021  3
8  11/08/2021    10/08/2021  3

I want to extrapolate the data so that when the value date is smaller than the business date, I copy the observations and change the date so that I have an observation for each business date until business date equals value date. What I want to have is this:

   Business_dt   Value_dt    ID
0  01/01/2021    01/01/2021  1
1  02/01/2021    01/01/2021  1
2  03/01/2021    01/01/2021  1
3  04/01/2021    01/01/2021  1
4  05/01/2021    01/01/2021  1
5  06/01/2021    01/01/2021  1
6  07/01/2021    01/01/2021  1
7  08/01/2021    01/01/2021  1
8  13/02/2021    13/02/2021  2
9  14/02/2021    13/02/2021  2
10 15/02/2021    13/02/2021  2
11 16/02/2021    13/02/2021  2
12 17/02/2021    13/02/2021  2
13 10/08/2021    10/08/2021  3
14 11/08/2021    10/08/2021  3

So far I selected the observations where the business_date is smaller than the value_dt and tried the following code in Python. At the end I was thinking about appending these dates to the original dataframe. However, I do not manage to extrapolate the data. How would you do it?:

df.date_range = pd.date_range(df.valdt, df.business_date - timedelta(days = df.report_diff), freq='d')

where report_diff is the difference between both dates in days.


Solution

  • You can use a custom function to reindex the dates:

    def add_missing(d):
        MIN, MAX = pd.to_datetime(d['Business_dt'], dayfirst=True).agg(['min', 'max'])
        idx = (pd.date_range(min(MIN, pd.to_datetime(d['Value_dt'].iloc[0])),
                             MAX, freq='D')
                 .strftime('%d/%m/%Y')
              )
        return (d.set_index('Business_dt')
                 .reindex(idx, method='bfill')
                 .reset_index()
               )
    
    
    out = df.groupby('ID', group_keys=False).apply(add_missing)
    
    print(out)
    

    Output:

            index    Value_dt  ID
    0  01/01/2021  01/01/2021   1
    1  02/01/2021  01/01/2021   1
    2  03/01/2021  01/01/2021   1
    3  04/01/2021  01/01/2021   1
    4  05/01/2021  01/01/2021   1
    5  06/01/2021  01/01/2021   1
    6  07/01/2021  01/01/2021   1
    7  08/01/2021  01/01/2021   1
    0  13/02/2021  13/02/2021   2
    1  14/02/2021  13/02/2021   2
    2  15/02/2021  13/02/2021   2
    3  16/02/2021  13/02/2021   2
    4  17/02/2021  13/02/2021   2
    0  10/08/2021  10/08/2021   3
    1  11/08/2021  10/08/2021   3