I want to filter the datapoints I have, until I only have the datapoints were the participant was asleep left. I have my dataframe with DateTime values and the values I am researching, and a different dataframe that has when the participant started sleeping, and when they ended sleeping. Instead of having to write out every start and stop time in a boolean mask, I wanted to know if there is a way to do this by iterating over either the big dataframe or the dataframe with the start and end sleeping times, or any other better way than having to manually enter 175 nights.
The start/stop dataframe looks like this, I have one for every participant:
df_sleep1:
date start stop
5/30/2023 5/29/2023 22:15 5/30/2023 7:22
5/31/2023 5/30/2023 23:19 5/31/2023 6:46
6/1/2023 6/1/2023 0:02 6/1/2023 8:31
The dataframe with all the data looks like this, where I want to add an "asleep" column:
df:
DateTime HeartRate Participant Asleep
0 2023-05-29 23:44:00 76.0 1
1 2023-05-30 06:44:00 76.0 1
2 2023-05-30 20:45:00 84.0 1
3 2023-05-31 04:45:00 84.0 2
4 2023-06-1 20:46:00 81.0 2
What I have tried:
dt = df['DateTime'].to_numpy()
start1 = df_sleep1['Start'].to_numpy()[:, None]
end1 = df_sleep1['Stop'].to_numpy()[:, None]
mask1 = ((start1 <= dt) & (dt <= end1) & (df['Participant'] == 1))
df['Sleep'] = mask1.any(axis=0)
def sleepFunction(row):
if (df_sleep1['Start'] <= dt) & (dt <= df_sleep1['Stop']) & (df['Participant'] == 1):
return True
else:
return False
df['sleepState'] = df.apply(lambda row: sleepFunction(row), axis = 1)
Both give similar errors about the shapes of the dataframes/arrays not matching up, which is not something I want to do anyways.
If the intervals are non-overlapping, an efficient method would be to use a merge_asof
: merge on the starting date by participant, then ensure that the date is after the end.
# map the Participant ID to the df_sleep DataFrame
all_sleep = pd.concat({1: df_sleep1}, names=['Participant']).reset_index(level=0)
# ensure having datetime types
all_sleep[['start', 'stop']] = all_sleep[['start', 'stop']].apply(pd.to_datetime)
df['DateTime'] = pd.to_datetime(df['DateTime'])
# merge by date and participant
df['Asleep'] = (
pd.merge_asof(df.sort_values(by='DateTime').reset_index(),
all_sleep.sort_values(by='start'),
left_on='DateTime', right_on='start',
by='Participant'
)
.assign(Asleep=lambda d: d['DateTime'].le(d['stop']))
.set_index('index')['Asleep']
)
Output:
DateTime HeartRate Participant Asleep
0 2023-05-29 23:44:00 76.0 1 True
1 2023-05-30 06:44:00 76.0 1 True
2 2023-05-30 20:45:00 84.0 1 False
3 2023-05-31 04:45:00 84.0 2 False
4 2023-06-01 20:46:00 81.0 2 False