I have a pandas dataframe that contains a patient's ID and hospital admit time. I want to filter out the rows where a patient's admission occurs within 30 days of the previous admission (but include the first admission). With SQL I was able to do this using the lag
function:
case
-- mark the first hospital adm
when dense_rank() over (partition by adm.subject_id order by adm.admittime) = 1 then true
-- mark subsequent hospital adms if its been atleast a month since previous admission.
when round((cast(extract(epoch from adm.admittime - lag(adm.admittime, 1) over (partition by
adm.subject_id order by adm.admittime))/(60*60*24) as numeric)), 2) >= 30.0 then true
else false end as include_adm
How do I do this with pandas? Essentially, I want to filter out one row of patient ID 30 from the following dataframe:
id admit_time
0 30 2018-10-03
1 30 2018-10-29
2 13 2017-11-01
3 13 2018-02-27
to get
id admit_time
0 30 2018-10-03
1 13 2017-11-01
2 13 2018-02-27
as the patient's second admission falls within 30 days of the first admission. But both the admissions of patient 13 are kept as the difference between the admission dates is more than 30 days.
What I've shown above is an example dataframe. The real dataframe consists of more columns and rows. More specifically, one of the column is a clinical note for that patient taken at a specified time. Thus, there are a lot of duplicate information in the rows expect for the clinical note. For example, the above dataframe:
id admit_time note
0 30 2018-10-03 note_content1
1 30 2018-10-03 note_content2
2 30 2018-10-29 note_content1
3 30 2018-10-29 note_content2
4 13 2017-11-01 note_content1
5 13 2018-02-27 note_content2
6 13 2018-02-27 note_content2
should result in the following dataframe after filteration:
id admit_time note
0 30 2018-10-03 note_content1
1 30 2018-10-03 note_content2
2 13 2017-11-01 note_content1
3 13 2018-02-27 note_content1
4 13 2018-02-27 note_content2
We use diff
to calculate the difference of each admit_time
per id
group on the sorted dataframe and select any rows with NaT
difference (i.e. first row per group) or difference > 30 days. Finally we remove the auxiliary column delta
:
df['delta'] = df.sort_values(['id', 'admit_time']).groupby('id')['admit_time'].transform(lambda x: x.diff())
df = df[df.delta.isna() | (df.delta >= pd.Timedelta(days=30))].drop(columns='delta')
Output:
id admit_time
0 30 2018-10-03
2 13 2017-11-01
3 13 2018-02-27
Update for the amended question:
Group by ['id','note']
instead of just 'id'
:
df['delta'] = df.sort_values(['id', 'admit_time']).groupby(['id','note'])['admit_time'].transform(lambda x: x.diff())
df = df[df.delta.isna() | (df.delta >= pd.Timedelta(days=30))].drop(columns='delta')
Result:
id admit_time note
0 30 2018-10-03 note_content1
1 30 2018-10-03 note_content2
4 13 2017-11-01 note_content1
5 13 2018-02-27 note_content1
6 13 2018-02-27 note_content2