I have to expand a pandas dataframe based on start date and end date, into individual rows.
Original dataframe is as below
My final dataframe should be repeated for each day between start and end date of individual rows.The result needs to be expanded for each date while the other columns except 'startdate' and 'enddate' is preserved.
For instance , the first row with startdate = 01-Jan-20 and enddate 15-Jan-20 should be expanded as 15 seperate rows representing one date in the series as shown in the sample resultant datafarame here:
I tried with the solution of itertuples to iterate over the dataframe and break the ranges to individual dates, but the solution is slow when the size of the dataframe is large.
Any optimal solution on this is highly appreciated.
Use pandas.date_range
in a list comprehension ,then use DataFrame.explode
(you need to be using at least pandas v 0.25.0
for the explode
method):
# Minimal example setup
df = pd.DataFrame({
'TRIPNAME': ['HIGHSEASON', 'LOWSEASON', 'MEDSEASON'],
'TRIPCAT': ['H', 'L', 'M'],
'STARTDATE' : ['01JAN20', '16SEP20', '29JAN20'],
'ENDDATE': ['15JAN20', '30NOV20', '19JUL20'],
'FARE': [345, 280, 250]
})
df['DATE'] = [pd.date_range(s, e, freq='d') for s, e in
zip(pd.to_datetime(df['STARTDATE']),
pd.to_datetime(df['ENDDATE']))]
df = df.explode('DATE').drop(['STARTDATE', 'ENDDATE'], axis=1)
print(df)
[out]
TRIPNAME TRIPCAT FARE DATE
0 HIGHSEASON H 345 2020-01-01
0 HIGHSEASON H 345 2020-01-02
0 HIGHSEASON H 345 2020-01-03
0 HIGHSEASON H 345 2020-01-04
0 HIGHSEASON H 345 2020-01-05
.. ... ... ... ...
2 MEDSEASON M 250 2020-07-15
2 MEDSEASON M 250 2020-07-16
2 MEDSEASON M 250 2020-07-17
2 MEDSEASON M 250 2020-07-18
2 MEDSEASON M 250 2020-07-19
[264 rows x 4 columns]