I have a pandas dataframe where each row corresponds to a period of time for a given record. If a record has more than one period of time there is a gap between them. I would like to fill in all the missing time periods that are between the end of the first time period and the start of the final time period.
My data looks like this:
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-02-01 2001-02-28
2 2 2000-01-01 2001-01-31
3 2 2001-05-31 2001-08-16
4 2 2001-09-01 2001-09-30
The gaps in time are between lines 0 and 1 (the stop time is 2001-01-15 and the next start time is 2001-02-01, which is a 16 day gap), as well as 2 and 3, and 3 and 4. Gaps can only happen between the first and last row for a given record.
What I'm trying to achieve is this:
record start_time stop_time
0 1 2001-01-01 2001-01-15
1 1 2001-01-16 2001-01-31
2 1 2001-02-01 2001-02-28
3 2 2000-01-01 2001-01-31
4 2 2001-02-01 2001-05-30
5 2 2001-05-31 2001-08-16
6 2 2001-08-17 2001-08-31
7 2 2001-09-01 2001-09-30
That is, I want to add in rows that have start and stop times that fit those gaps. So in the previous example there would be a new row for record 1 with a start date of 2001-01-16 and an end date of 2001-01-31.
The full dataset has over 2M rows across 1.5M records, so I'm looking for a vectorized solution in pandas that doesn't use apply and is relatively efficient.
Maybe something like this?
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
missing_dates = []
for record, df_per_record in df.groupby('record'):
start_time = pd.to_datetime(df_per_record.start_time)
stop_time = pd.to_datetime(df_per_record.stop_time)
reference_date = pd.Timestamp(df_per_record.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
missing_start_dates = stop_time[:-1][dates_diff > 1] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff-2) * one_day)
missing_dates.append(pd.DataFrame({"record": record, "start_time": missing_start_dates, "stop_time": missing_stop_dates}))
print(pd.concat([df]+missing_dates).sort_values(["record", "start_time"]))
Edit:
version #2 this time without the for loop:
import pandas as pd
record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
one_day = pd.Timedelta('1d')
start_time = pd.to_datetime(df.start_time)
stop_time = pd.to_datetime(df.stop_time)
reference_date = pd.Timestamp(df.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day
is_same_record = df.record.iloc[1:].values == df.record.iloc[:-1].values
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
mask = (dates_diff > 1) & is_same_record
missing_start_dates = stop_time[:-1][mask] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff[is_same_record]-2) * one_day)
missing_dates = pd.DataFrame({"record": df.record.iloc[:-1][mask], "start_time": missing_start_dates, "stop_time": missing_stop_dates})
print(pd.concat([df, missing_dates]).sort_values(["record", "start_time"]).reset_index(drop=True))