Search code examples
pythonpandasdataframedatetimefilter

Filter DataFrame events not in time windows DataFrame


I have a DataFrame of events (Event Name - Time) and a DataFrame of time windows (Start Time - End Time). I want to get a DataFrame containing only the events not in any of the time windows. I am looking for a "pythonic" way to filter the DataFrame.

Example: Events DataFrame:

Event Name Event Time
Event1 02/01/2000 00:00:00
Event2 05/01/2000 10:00:00
Event3 07/01/2000 09:00:00
Event4 10/01/2000 02:00:00

Time Windows DataFrame:

Time Window Name Start Time End Time
Window1 01/01/2000 00:00:00 06/01/2000 00:00:00
Window2 10/01/2000 01:00:00 10/01/2000 04:00:00

Result: Filtered Events DataFrame:

Event Name Event Time
Event3 07/01/2000 09:00:00

Setup:

import pandas as pd

events_data = {
    'Event Name': ['Event1', 'Event2', 'Event3', 'Event4'],
    'Event Time': ['02/01/2000 00:00:00', '05/01/2000 10:00:00', '07/01/2000 09:00:00', '10/01/2000 02:00:00']
}

time_windows_data = {
    'Time Window Name': ['Window1', 'Window2'],
    'Start Time': ['01/01/2000 00:00:00', '10/01/2000 01:00:00'],
    'End Time': ['06/01/2000 00:00:00', '10/01/2000 04:00:00']
}

events_df = pd.DataFrame(events_data)
time_windows_df = pd.DataFrame(time_windows_data)

events_df['Event Time'] = pd.to_datetime(events_df['Event Time'], format='%d/%m/%Y %H:%M:%S')
time_windows_df['Start Time'] = pd.to_datetime(time_windows_df['Start Time'], format='%d/%m/%Y %H:%M:%S')
time_windows_df['End Time'] = pd.to_datetime(time_windows_df['End Time'], format='%d/%m/%Y %H:%M:%S')

Solution

  • You can build an IntervalIndex then create a boolean mask with reindex:

    # build IntervalIndex
    idx = pd.IntervalIndex.from_arrays(df_time['Start Time'], df_time['End Time'])
    
    # build boolean mask
    m = (pd.Series(False, index=idx)
           .reindex(df_events['Event Time'],fill_value=True)
           .to_numpy()
        )
    
    # select non-matching rows
    out = df_events[m]
    

    Alternative to build m:

    m = idx.reindex(df_events['Event Time'])[1] == -1
    

    Output:

      Event Name          Event Time
    2     Event3 2000-01-07 09:00:00
    

    Intermediates:

    # idx
    IntervalIndex([(2000-01-01 00:00:00, 2000-01-06 00:00:00],
                   (2000-01-10 01:00:00, 2000-01-10 04:00:00]],
                  dtype='interval[datetime64[ns], right]')
    
    # m
    array([False, False,  True, False])
    

    Reproducible inputs:

    import pandas as pd
    from pandas import Timestamp
    
    df_events = pd.DataFrame({'Event Name': ['Event1', 'Event2', 'Event3', 'Event4'],
                              'Event Time': [Timestamp('2000-01-02 00:00:00'),
                                             Timestamp('2000-01-05 10:00:00'),
                                             Timestamp('2000-01-07 09:00:00'),
                                             Timestamp('2000-01-10 02:00:00')]})
    df_time = pd.DataFrame({'Time Window Name': ['Window1', 'Window2'],
                            'Start Time': [Timestamp('2000-01-01 00:00:00'), Timestamp('2000-01-10 01:00:00')],
                            'End Time': [Timestamp('2000-01-06 00:00:00'), Timestamp('2000-01-10 04:00:00')]})