Search code examples
pythonpandasnumpyreducemask

Given a list of lists of two timestamps, how to check if the index of a dataframe is inside those datetime range?


I have a list of lists of two elements with type pd.Timestamp, e.g.:

intervals = [
    [pd.Timestamp('2023-01-01 02:00:00'), pd.Timestamp('2023-01-01 03:00:00')], 
    [pd.Timestamp('2023-01-01 05:00:00'), pd.Timestamp('2023-01-01 07:00:00')],
    [pd.Timestamp('2023-01-01 07:30:00'), pd.Timestamp('2023-01-01 08:00:00')],
]

I also have a sample dataframe, e.g.:

data = {'value': [1, 2, 3, 4]}
index = [
    pd.Timestamp('2023-01-01 01:00:00'),
    pd.Timestamp('2023-01-01 02:00:00'), 
    pd.Timestamp('2023-01-01 03:00:00'),
    pd.Timestamp('2023-01-01 04:00:00'),
]
df = pd.DataFrame(data, index=index)

Now, how can I check if the index of the dataframe is inside all the datetime intervals? In other words, for every element of the index of the dataframe (a row) check if that row is contained in the first interval, or in the second interval, or in the third intervals, and so on by doing something like:

for start, end in intervals:
    df.index >= start & df.index >= end

My idea was to:

  1. Create a boolean mask for each interval that check if every element of the index is inside the interval
masks = [
   [False, True, True, False],
   [False, False, False, False],
   [False, False, False, False],
]
  1. Reduce the above list of masks into a single mask by applying the | (logical or) operator
mask = [False, True, True, False]

I wish to do 1 and 2 in a vectorized manner, so I would like to use numpy or pandas when possible. I am managing the data structure in Python with lists though at the moment. Which is the fastest way and which is the most elegant one also?


Solution

  • You can use an IntervalIndex with get_indexer_for to identify the existing matches, then all to aggregate:

    idx = pd.IntervalIndex.from_tuples(list(map(tuple, intervals)), closed='both')
    # or
    # idx = pd.IntervalIndex.from_arrays(*zip(*intervals), closed='both')
    
    out = (idx.get_indexer_for(df.index) != -1).all()
    

    Output: False

    How it works:

    idx.get_indexer_for(df.index)
    # array([-1,  0,  0, -1])
    
    # check if values are found
    idx.get_indexer_for(df.index) != -1
    # array([False, True,  True, False])
    
    # are all values found?
    (idx.get_indexer_for(df.index) != -1).all()
    # False
    

    checking which elements are inside a range:

    which = idx.get_indexer_for(df.index) != -1
    

    Output: array([False, True, True, False])

    all combinations

    If you need to compare all combinations, use :

    i = df.index.to_numpy()[:,None]
    idx = pd.IntervalIndex.from_arrays(*zip(*intervals), closed='both')
    
    out = (i >= idx.left) & (i <= idx.right) 
    

    Output:

    array([[False, False, False, False],
           [ True, False, False, False],
           [ True, False, False, False],
           [False, False, False, False]])