Search code examples
pythonpandasdataframedatetimevectorization

Creating missing time ranges in pandas


I have a pandas dataframe where each row corresponds to a period of time for a given record. If a record has more than one period of time there is a gap between them. I would like to fill in all the missing time periods that are between the end of the first time period and the start of the final time period.

My data looks like this:

record = [1, 1, 2, 2, 2]
start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})

   record start_time  stop_time
0       1 2001-01-01 2001-01-15
1       1 2001-02-01 2001-02-28
2       2 2000-01-01 2001-01-31
3       2 2001-05-31 2001-08-16
4       2 2001-09-01 2001-09-30

The gaps in time are between lines 0 and 1 (the stop time is 2001-01-15 and the next start time is 2001-02-01, which is a 16 day gap), as well as 2 and 3, and 3 and 4. Gaps can only happen between the first and last row for a given record.

What I'm trying to achieve is this:

   record start_time  stop_time
0       1 2001-01-01 2001-01-15
1       1 2001-01-16 2001-01-31
2       1 2001-02-01 2001-02-28
3       2 2000-01-01 2001-01-31
4       2 2001-02-01 2001-05-30
5       2 2001-05-31 2001-08-16
6       2 2001-08-17 2001-08-31
7       2 2001-09-01 2001-09-30

That is, I want to add in rows that have start and stop times that fit those gaps. So in the previous example there would be a new row for record 1 with a start date of 2001-01-16 and an end date of 2001-01-31.

The full dataset has over 2M rows across 1.5M records, so I'm looking for a vectorized solution in pandas that doesn't use apply and is relatively efficient.


Solution

  • Maybe something like this?

    import pandas as pd
    
    record = [1, 1, 2, 2, 2]
    start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
    stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
    df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
    
    one_day = pd.Timedelta('1d')
    
    missing_dates = []
    for record, df_per_record in df.groupby('record'):
        start_time = pd.to_datetime(df_per_record.start_time)
        stop_time = pd.to_datetime(df_per_record.stop_time)
    
        reference_date = pd.Timestamp(df_per_record.start_time.iloc[0])
        start_time_in_days = (start_time - reference_date) // one_day
        stop_time_in_days = (stop_time - reference_date) // one_day
    
        dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
        missing_start_dates = stop_time[:-1][dates_diff > 1] + one_day
        missing_stop_dates = missing_start_dates + ((dates_diff-2) * one_day)
        missing_dates.append(pd.DataFrame({"record": record, "start_time": missing_start_dates, "stop_time": missing_stop_dates}))
    
    print(pd.concat([df]+missing_dates).sort_values(["record", "start_time"]))
    

    Edit:
    version #2 this time without the for loop:

    import pandas as pd
    
    record = [1, 1, 2, 2, 2]
    start_time = pd.to_datetime(['2001-01-01', '2001-02-01', '2000-01-01', '2001-05-31', '2001-09-01'])
    stop_time = pd.to_datetime(['2001-01-15', '2001-02-28', '2001-01-31', '2001-08-16', '2001-09-30'])
    df = pd.DataFrame({'record': record, 'start_time': start_time, 'stop_time': stop_time})
    
    one_day = pd.Timedelta('1d')
    
    start_time = pd.to_datetime(df.start_time)
    stop_time = pd.to_datetime(df.stop_time)
    
    reference_date = pd.Timestamp(df.start_time.iloc[0])
    start_time_in_days = (start_time - reference_date) // one_day
    stop_time_in_days = (stop_time - reference_date) // one_day
    
    is_same_record = df.record.iloc[1:].values == df.record.iloc[:-1].values
    dates_diff = start_time_in_days.iloc[1:].values - stop_time_in_days.iloc[:-1].values
    mask = (dates_diff > 1) & is_same_record
    missing_start_dates = stop_time[:-1][mask] + one_day
    missing_stop_dates = missing_start_dates + ((dates_diff[is_same_record]-2) * one_day)
    missing_dates = pd.DataFrame({"record": df.record.iloc[:-1][mask], "start_time": missing_start_dates, "stop_time": missing_stop_dates})
    
    print(pd.concat([df, missing_dates]).sort_values(["record", "start_time"]).reset_index(drop=True))