Search code examples
pandasdatetimejupyter-notebookboolean-expression

Can I use a boolean mask to find if a DateTime value falls between two other DateTime values in a different dataframe


I want to filter the datapoints I have, until I only have the datapoints were the participant was asleep left. I have my dataframe with DateTime values and the values I am researching, and a different dataframe that has when the participant started sleeping, and when they ended sleeping. Instead of having to write out every start and stop time in a boolean mask, I wanted to know if there is a way to do this by iterating over either the big dataframe or the dataframe with the start and end sleeping times, or any other better way than having to manually enter 175 nights.

The start/stop dataframe looks like this, I have one for every participant:

df_sleep1:

date            start       stop
5/30/2023   5/29/2023 22:15 5/30/2023 7:22
5/31/2023   5/30/2023 23:19 5/31/2023 6:46
6/1/2023    6/1/2023 0:02   6/1/2023 8:31

The dataframe with all the data looks like this, where I want to add an "asleep" column:

df:

        DateTime            HeartRate        Participant      Asleep
0   2023-05-29 23:44:00 76.0             1
1   2023-05-30 06:44:00 76.0             1
2   2023-05-30 20:45:00 84.0             1
3   2023-05-31 04:45:00 84.0             2
4   2023-06-1 20:46:00  81.0             2

What I have tried:

dt = df['DateTime'].to_numpy()

start1 = df_sleep1['Start'].to_numpy()[:, None]
end1 = df_sleep1['Stop'].to_numpy()[:, None]
    
mask1 = ((start1 <= dt) & (dt <= end1) & (df['Participant'] == 1))
df['Sleep'] = mask1.any(axis=0)
def sleepFunction(row):
    if (df_sleep1['Start'] <= dt) & (dt <= df_sleep1['Stop']) & (df['Participant'] == 1):
        return True
    else:
        return False

df['sleepState'] = df.apply(lambda row: sleepFunction(row), axis = 1)

Both give similar errors about the shapes of the dataframes/arrays not matching up, which is not something I want to do anyways.


Solution

  • If the intervals are non-overlapping, an efficient method would be to use a merge_asof: merge on the starting date by participant, then ensure that the date is after the end.

    # map the Participant ID to the df_sleep DataFrame
    all_sleep = pd.concat({1: df_sleep1}, names=['Participant']).reset_index(level=0)
    
    # ensure having datetime types
    all_sleep[['start', 'stop']] = all_sleep[['start', 'stop']].apply(pd.to_datetime)
    df['DateTime'] = pd.to_datetime(df['DateTime'])
    
    # merge by date and participant
    df['Asleep'] = (
     pd.merge_asof(df.sort_values(by='DateTime').reset_index(),
                   all_sleep.sort_values(by='start'),
                   left_on='DateTime', right_on='start',
                   by='Participant'
                  )
       .assign(Asleep=lambda d: d['DateTime'].le(d['stop']))
       .set_index('index')['Asleep']
    )
    

    Output:

                 DateTime  HeartRate  Participant  Asleep
    0 2023-05-29 23:44:00       76.0            1    True
    1 2023-05-30 06:44:00       76.0            1    True
    2 2023-05-30 20:45:00       84.0            1   False
    3 2023-05-31 04:45:00       84.0            2   False
    4 2023-06-01 20:46:00       81.0            2   False