Search code examples
pythonpandaslag

How can I convert this SQL code to equivalent pandas code involving a lag function?


I have a pandas dataframe that contains a patient's ID and hospital admit time. I want to filter out the rows where a patient's admission occurs within 30 days of the previous admission (but include the first admission). With SQL I was able to do this using the lag function:

case
  -- mark the first hospital adm 
      when dense_rank() over (partition by adm.subject_id order by adm.admittime) = 1 then true
  -- mark subsequent hospital adms if its been atleast a month since previous admission.
      when round((cast(extract(epoch from adm.admittime - lag(adm.admittime, 1) over (partition by
        adm.subject_id order by adm.admittime))/(60*60*24) as numeric)), 2) >= 30.0 then true
      else false end as include_adm

How do I do this with pandas? Essentially, I want to filter out one row of patient ID 30 from the following dataframe:

id  admit_time
0   30  2018-10-03
1   30  2018-10-29
2   13  2017-11-01
3   13  2018-02-27

to get

id  admit_time
0   30  2018-10-03
1   13  2017-11-01
2   13  2018-02-27

as the patient's second admission falls within 30 days of the first admission. But both the admissions of patient 13 are kept as the difference between the admission dates is more than 30 days.

What I've shown above is an example dataframe. The real dataframe consists of more columns and rows. More specifically, one of the column is a clinical note for that patient taken at a specified time. Thus, there are a lot of duplicate information in the rows expect for the clinical note. For example, the above dataframe:

id  admit_time  note
0   30  2018-10-03  note_content1
1   30  2018-10-03  note_content2
2   30  2018-10-29  note_content1
3   30  2018-10-29  note_content2
4   13  2017-11-01  note_content1
5   13  2018-02-27  note_content2
6   13  2018-02-27  note_content2

should result in the following dataframe after filteration:

id  admit_time  note
0   30  2018-10-03  note_content1
1   30  2018-10-03  note_content2
2   13  2017-11-01  note_content1
3   13  2018-02-27  note_content1
4   13  2018-02-27  note_content2

Solution

  • We use diff to calculate the difference of each admit_time per id group on the sorted dataframe and select any rows with NaT difference (i.e. first row per group) or difference > 30 days. Finally we remove the auxiliary column delta:

    df['delta'] = df.sort_values(['id', 'admit_time']).groupby('id')['admit_time'].transform(lambda x: x.diff())
    df = df[df.delta.isna() | (df.delta >= pd.Timedelta(days=30))].drop(columns='delta')
    

    Output:

       id admit_time
    0  30 2018-10-03
    2  13 2017-11-01
    3  13 2018-02-27
    


    Update for the amended question:

    Group by ['id','note'] instead of just 'id':

    df['delta'] = df.sort_values(['id', 'admit_time']).groupby(['id','note'])['admit_time'].transform(lambda x: x.diff())
    df = df[df.delta.isna() | (df.delta >= pd.Timedelta(days=30))].drop(columns='delta')
    

    Result:

       id admit_time           note
    0  30 2018-10-03  note_content1
    1  30 2018-10-03  note_content2
    4  13 2017-11-01  note_content1
    5  13 2018-02-27  note_content1
    6  13 2018-02-27  note_content2