python pandas dataframe datetime vectorization

Creating missing time ranges in pandas

I have a pandas dataframe where each row corresponds to a period of time for a given record. If a record has more than one period of time there is a gap between them. I would like to fill in all the missing time periods that are between the end of the first time period and the start of the final time period.

My data looks like this:

record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})

   record start_time  stop_time
0       1 2001-01-01 2001-01-15
1       1 2001-02-01 2001-02-28
2       2 2000-01-01 2001-01-31
3       2 2001-05-31 2001-08-16
4       2 2001-09-01 2001-09-30

The gaps in time are between lines 0 and 1 (the stop time is 2001-01-15 and the next start time is 2001-02-01, which is a 16 day gap), as well as 2 and 3, and 3 and 4. Gaps can only happen between the first and last row for a given record.

What I'm trying to achieve is this:

   record start_time  stop_time
0       1 2001-01-01 2001-01-15
1       1 2001-01-16 2001-01-31
2       1 2001-02-01 2001-02-28
3       2 2000-01-01 2001-01-31
4       2 2001-02-01 2001-05-30
5       2 2001-05-31 2001-08-16
6       2 2001-08-17 2001-08-31
7       2 2001-09-01 2001-09-30

That is, I want to add in rows that have start and stop times that fit those gaps. So in the previous example there would be a new row for record 1 with a start date of 2001-01-16 and an end date of 2001-01-31.

The full dataset has over 2M rows across 1.5M records, so I'm looking for a vectorized solution in pandas that doesn't use apply and is relatively efficient.

Solution

Maybe something like this?

import pandas as pd

record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})

one_day = pd.Timedelta('1d')

missing_dates = []
for record, df_per_record in df.groupby('record'):
    start_time = pd.to_datetime(df_per_record.start_time)
    stop_time = pd.to_datetime(df_per_record.stop_time)

    reference_date = pd.Timestamp(df_per_record.start_time.iloc[0])
    start_time_in_days = (start_time - reference_date) // one_day
    stop_time_in_days = (stop_time - reference_date) // one_day

    dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
    missing_start_dates = stop_time[:-1][dates_diff > 1] + one_day
    missing_stop_dates = missing_start_dates + ((dates_diff-2) * one_day)
    missing_dates.append(pd.DataFrame({"record": record, "start_time": missing_start_dates, "stop_time": missing_stop_dates}))

print(pd.concat([df]+missing_dates).sort_values(["record", "start_time"]))

Edit:
version #2 this time without the for loop:

import pandas as pd

record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})

one_day = pd.Timedelta('1d')

start_time = pd.to_datetime(df.start_time)
stop_time = pd.to_datetime(df.stop_time)

reference_date = pd.Timestamp(df.start_time.iloc[0])
start_time_in_days = (start_time - reference_date) // one_day
stop_time_in_days = (stop_time - reference_date) // one_day

is_same_record = df.record.iloc[1:].values == df.record.iloc[:-1].values
dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
mask = (dates_diff > 1) & is_same_record
missing_start_dates = stop_time[:-1][mask] + one_day
missing_stop_dates = missing_start_dates + ((dates_diff[is_same_record]-2) * one_day)
missing_dates = pd.DataFrame({"record": df.record.iloc[:-1][mask], "start_time": missing_start_dates, "stop_time": missing_stop_dates})

print(pd.concat([df, missing_dates]).sort_values(["record", "start_time"]).reset_index(drop=True))