I have cleaned up a data set to get it into this format. The assigned_pat_loc
represents a room number, so I am trying to identify when two different patients (patient_id
) are in the same room at the same time; i.e., overlapping start_time
and end_time
between rows with the same assigned_pat_loc
but different patient_id
's. The start_time
and end_time
represent the times that that particular patient was in that room. So if those times are overlapping between two patients in the same room, it means that they shared the room together. This is what I'm ultimately looking for. Here is the base data set from which I want to construct these changes:
patient_id assigned_pat_loc start_time end_time
0 19035648 SICU^6108 2009-01-10 18:27:48 2009-02-25 15:45:54
1 19039244 85^8520 2009-01-02 06:27:25 2009-01-05 10:38:41
2 19039507 55^5514 2009-01-01 13:25:45 2009-01-01 13:25:45
3 19039555 EIAB^EIAB 2009-01-15 01:56:48 2009-02-23 11:36:34
4 19039559 EIAB^EIAB 2009-01-16 11:24:18 2009-01-19 18:41:33
... ... ... ... ...
140906 46851413 EIAB^EIAB 2011-12-31 22:28:38 2011-12-31 23:15:49
140907 46851422 EIAB^EIAB 2011-12-31 21:52:44 2011-12-31 22:50:08
140908 46851430 4LD^4LDX 2011-12-31 22:41:10 2011-12-31 22:44:48
140909 46851434 EIC^EIC 2011-12-31 23:45:22 2011-12-31 23:45:22
140910 46851437 EIAB^EIAB 2011-12-31 22:54:40 2011-12-31 23:30:10
I am thinking I should approach this with a groupby of some sort, but I'm not sure exactly how to implement. I would show an attempt but it took me about 6 hours to even get to this point so I would appreciate even just some thoughts.
EDIT
Example of original data:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer
2 Jun/1/11 1:00 8w^201 10E^45 Transfer
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer
4 Jun/1/11 8:00 10E^45 Death
Example of desired output:
id r_id start_date start_time end_date end_time length location
1 2 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
1 3 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
2 4 May/31/11 7:00 Jun/1/11 1:00 18 10E^45
2 1 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
3 1 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
Where r_id
is the "other" patient who is sharing the same room as another one, and length
is the number of time in hours that the room was shared.
In this example:
Another option.
I'm starting wiht the original data after the EDIT, but I have changed this row
4 May/31/11 9:00 EIAB^EIAB^9 Admission
to
4 May/31/11 6:00 EIAB^EIAB^9 Admission
because I think the admission time should be before the transfer time?
The first step is essentially to get a dataframe similiar to the one you're starting out with:
df = (
df.assign(start_time=pd.to_datetime((df["Date"] + " " + df["Time"])))
.sort_values(["id", "start_time"])
.assign(duration=lambda df: -df.groupby("id")["start_time"].diff(-1))
.loc[lambda df: df["duration"].notna()]
.assign(end_time=lambda df: df["start_time"] + df["duration"])
.rename(columns={"assigned_pat_loc": "location"})
[["id", "location", "start_time", "end_time"]]
)
Result for the sample:
id location start_time end_time
0 1 EIAB^EIAB^6 2011-05-31 08:00:00 2011-05-31 09:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
3 2 EIAB^EIAB^4 2011-05-31 05:00:00 2011-05-31 07:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
7 3 EIAB^EIAB^2 2011-05-31 09:00:00 2011-06-01 09:00:00
8 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
10 4 EIAB^EIAB^9 2011-05-31 06:00:00 2011-05-31 07:00:00
11 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
The next step is merging df
with itself on the location
column and eliminating the rows where id
is the same as r_id
:
df = (
df.merge(df, on="location")
.rename(columns={"id_x": "id", "id_y": "r_id"})
.loc[lambda df: df["id"] != df["r_id"]]
)
Then finally get the rows with an actual overlap via m
, calculate the duration of the overlap, and bring the dataframe in the form you are looking for:
m = (
(df["start_time_x"].le(df["start_time_y"])
& df["start_time_y"].le(df["end_time_x"]))
| (df["start_time_y"].le(df["start_time_x"])
& df["start_time_x"].le(df["end_time_y"]))
)
df = (
df[m]
.assign(
start_time=lambda df: df[["start_time_x", "start_time_y"]].max(axis=1),
end_time=lambda df: df[["end_time_x", "end_time_y"]].min(axis=1),
duration=lambda df: df["end_time"] - df["start_time"]
)
.assign(
start_date=lambda df: df["start_time"].dt.date,
start_time=lambda df: df["start_time"].dt.time,
end_date=lambda df: df["end_time"].dt.date,
end_time=lambda df: df["end_time"].dt.time
)
[[
"id", "r_id",
"start_date", "start_time", "end_date", "end_time",
"duration", "location"
]]
.sort_values(["id", "r_id"]).reset_index(drop=True)
)
Result for the sample:
id r_id start_date start_time end_date end_time duration \
0 1 2 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
1 1 3 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
2 2 1 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
3 2 4 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
4 3 1 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
5 4 2 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
location
0 8w^201
1 8w^201
2 8w^201
3 10E^45
4 8w^201
5 10E^45