Search code examples
pythonpandasdataframegroup-byoverlap

Find overlapping time intervals based on condition in another column pandas


I have cleaned up a data set to get it into this format. The assigned_pat_loc represents a room number, so I am trying to identify when two different patients (patient_id) are in the same room at the same time; i.e., overlapping start_time and end_time between rows with the same assigned_pat_loc but different patient_id's. The start_time and end_time represent the times that that particular patient was in that room. So if those times are overlapping between two patients in the same room, it means that they shared the room together. This is what I'm ultimately looking for. Here is the base data set from which I want to construct these changes:

      patient_id    assigned_pat_loc    start_time          end_time
0     19035648      SICU^6108           2009-01-10 18:27:48 2009-02-25 15:45:54
1     19039244      85^8520             2009-01-02 06:27:25 2009-01-05 10:38:41
2     19039507      55^5514             2009-01-01 13:25:45 2009-01-01 13:25:45
3     19039555      EIAB^EIAB           2009-01-15 01:56:48 2009-02-23 11:36:34
4     19039559      EIAB^EIAB           2009-01-16 11:24:18 2009-01-19 18:41:33
... ... ... ... ...
140906 46851413     EIAB^EIAB           2011-12-31 22:28:38 2011-12-31 23:15:49
140907 46851422     EIAB^EIAB           2011-12-31 21:52:44 2011-12-31 22:50:08
140908 46851430     4LD^4LDX            2011-12-31 22:41:10 2011-12-31 22:44:48
140909 46851434     EIC^EIC             2011-12-31 23:45:22 2011-12-31 23:45:22
140910 46851437     EIAB^EIAB           2011-12-31 22:54:40 2011-12-31 23:30:10

I am thinking I should approach this with a groupby of some sort, but I'm not sure exactly how to implement. I would show an attempt but it took me about 6 hours to even get to this point so I would appreciate even just some thoughts.

EDIT

Example of original data:

id  Date    Time        assigned_pat_loc    prior_pat_loc   Activity
1   May/31/11   8:00    EIAB^EIAB^6                         Admission
1   May/31/11   9:00    8w^201              EIAB^EIAB^6     Transfer 
1   Jun/8/11    15:00   8w^201                              Discharge
2   May/31/11   5:00    EIAB^EIAB^4                         Admission 
2   May/31/11   7:00    10E^45              EIAB^EIAB^4     Transfer
2   Jun/1/11    1:00    8w^201              10E^45          Transfer
2   Jun/1/11    8:00    8w^201                              Discharge
3   May/31/11   9:00    EIAB^EIAB^2                         Admission
3   Jun/1/11    9:00    8w^201              EIAB^EIAB^2     Transfer
3   Jun/5/11    9:00    8w^201                              Discharge
4   May/31/11   9:00    EIAB^EIAB^9                         Admission
4   May/31/11   7:00    10E^45              EIAB^EIAB^9     Transfer
4   Jun/1/11    8:00    10E^45                              Death

Example of desired output:

id  r_id    start_date  start_time  end_date    end_time    length  location 
1   2       Jun/1/11    1:00        Jun/1/11    8:00        7   8w^201
1   3       Jun/1/11    9:00        Jun/5/11    9:00        96  8w^201
2   4       May/31/11   7:00        Jun/1/11    1:00        18  10E^45
2   1       Jun/1/11    1:00        Jun/1/11    8:00        7   8w^201
3   1       Jun/1/11    9:00        Jun/5/11    9:00        96  8w^201

Where r_id is the "other" patient who is sharing the same room as another one, and length is the number of time in hours that the room was shared.

In this example:

  1. r_id is the name of the variable you will generate for the id of the other patient.
  2. patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
  3. Patient 2 also had two room sharing episodes. The first one was with patient 4 in in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
  4. Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
  5. Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
  6. Note: room-sharing episodes are listed twice, once for each patient.

Solution

  • Another option.

    I'm starting wiht the original data after the EDIT, but I have changed this row

    4   May/31/11   9:00    EIAB^EIAB^9                         Admission
    

    to

    4   May/31/11   6:00    EIAB^EIAB^9                         Admission
    

    because I think the admission time should be before the transfer time?

    The first step is essentially to get a dataframe similiar to the one you're starting out with:

    df = (
        df.assign(start_time=pd.to_datetime((df["Date"] + " " + df["Time"])))
        .sort_values(["id", "start_time"])
        .assign(duration=lambda df: -df.groupby("id")["start_time"].diff(-1))
        .loc[lambda df: df["duration"].notna()]
        .assign(end_time=lambda df: df["start_time"] + df["duration"])
        .rename(columns={"assigned_pat_loc": "location"})
        [["id", "location", "start_time", "end_time"]]
    )
    

    Result for the sample:

        id     location          start_time            end_time
    0    1  EIAB^EIAB^6 2011-05-31 08:00:00 2011-05-31 09:00:00
    1    1       8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
    3    2  EIAB^EIAB^4 2011-05-31 05:00:00 2011-05-31 07:00:00
    4    2       10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
    5    2       8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
    7    3  EIAB^EIAB^2 2011-05-31 09:00:00 2011-06-01 09:00:00
    8    3       8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
    10   4  EIAB^EIAB^9 2011-05-31 06:00:00 2011-05-31 07:00:00
    11   4       10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
    

    The next step is merging df with itself on the location column and eliminating the rows where id is the same as r_id:

    df = (
        df.merge(df, on="location")
        .rename(columns={"id_x": "id", "id_y": "r_id"})
        .loc[lambda df: df["id"] != df["r_id"]]
    )
    

    Then finally get the rows with an actual overlap via m, calculate the duration of the overlap, and bring the dataframe in the form you are looking for:

    m = (
        (df["start_time_x"].le(df["start_time_y"])
         & df["start_time_y"].le(df["end_time_x"]))
        | (df["start_time_y"].le(df["start_time_x"])
           & df["start_time_x"].le(df["end_time_y"]))
    )
    df = (
        df[m]
        .assign(
            start_time=lambda df: df[["start_time_x", "start_time_y"]].max(axis=1),
            end_time=lambda df: df[["end_time_x", "end_time_y"]].min(axis=1),
            duration=lambda df: df["end_time"] - df["start_time"]
        )
        .assign(
            start_date=lambda df: df["start_time"].dt.date,
            start_time=lambda df: df["start_time"].dt.time,
            end_date=lambda df: df["end_time"].dt.date,
            end_time=lambda df: df["end_time"].dt.time
        )
        [[
            "id", "r_id",
            "start_date", "start_time", "end_date", "end_time",
            "duration", "location"
        ]]
        .sort_values(["id", "r_id"]).reset_index(drop=True)
    )
    

    Result for the sample:

       id  r_id  start_date start_time    end_date  end_time        duration  \
    0   1     2  2011-06-01   01:00:00  2011-06-01  08:00:00 0 days 07:00:00   
    1   1     3  2011-06-01   09:00:00  2011-06-05  09:00:00 4 days 00:00:00   
    2   2     1  2011-06-01   01:00:00  2011-06-01  08:00:00 0 days 07:00:00   
    3   2     4  2011-05-31   07:00:00  2011-06-01  01:00:00 0 days 18:00:00   
    4   3     1  2011-06-01   09:00:00  2011-06-05  09:00:00 4 days 00:00:00   
    5   4     2  2011-05-31   07:00:00  2011-06-01  01:00:00 0 days 18:00:00   
    
      location  
    0   8w^201  
    1   8w^201  
    2   8w^201  
    3   10E^45  
    4   8w^201  
    5   10E^45