I'm using hospital admissions data, and trying to tie together a table of observations:
dt_taken | patient_id | observation | value |
---|---|---|---|
2020-04-13 00:00:00 | patient01 | "Heart rate" | 69 |
... |
... with a table of admissions:
patient_id | admission_id | startdate | enddate |
---|---|---|---|
patient01 | admission01 | 2020-04-01 00:04:20 | 2020-05-01 00:23:59 |
... |
... such that it returns a list of observations tied to admissions, and rejects all those that were not made during an admission (eroneously recorded, or taken in outpatient visits etc):
dt_taken | admission_id | observation | value |
---|---|---|---|
2020-04-13 00:00:00 | admission01 | "Heart rate" | 69 |
... |
My relatively simplistic approach to-date has been to iterate over each patient, then each patient's admission, then each observation and assign it to that admission, but given I have >36k admissions and >1million observations this is incredibly time-consuming (and my government-issued laptop hates me for it).
Is there a more efficient way that I'm missing, either using {data.table} (I must confess to being an absolute newbie here, preferring to work in {tidyverse}) or even that I can run on the SQL server where the tables are stored to save my aging laptop?
For data.table
, this is mostly a dupe of How to perform join over date ranges using data.table?, though that doesn't provide the RHS[LHS, on=.(..)]
method.
observations
# dt_taken patient_id observation value
# 1 2020-04-13 00:00:00 patient01 Heart rate 69
admissions
# patient_id admission_id startdate enddate
# 1 patient01 admission01 2020-04-01 00:04:20 2020-05-01 00:23:59
### convert to data.table
setDT(observations)
setDT(admissions)
### we need proper 'POSIXt' objects
observations[, dt_taken := as.POSIXct(dt_taken)]
admissions[, (dates) := lapply(.SD, as.POSIXct), .SDcols = dates]
And the join.
admissions[observations, on = .(patient_id, startdate <= dt_taken, enddate >= dt_taken)]
# patient_id admission_id startdate enddate observation value
# <char> <char> <POSc> <POSc> <char> <int>
# 1: patient01 admission01 2020-04-13 2020-04-13 Heart rate 69
Two things that I believe are noteworthy:
in SQL (and similarly in other join-friendly languages), it is often shown as
select ...
from TABLE1 left join TABLE2 ...
suggesting that TABLE1
is the LHS (left-hand side) and TABLE2
is the RHS table. (This is a gross generalization, mostly gearing towards a left-join since that's all that data.table::[
supports; for inner/outer/full joins, you'll need merge(.)
or other external mechanisms. See How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 for more discussion on JOINs, etc.)
From this, data.table::[
's mechanism is effectively
TABLE2[TABLE1, on = .(...)]
RHS[LHS, on = .(...)]
(Meaning that the right-hand-side table is actually the first table from left-to-right ...)
The names in the output of inequi-joins are preserved from the RHS, see that dt_taken
is not found. However, the values of those startdate
and enddate
columns are from dt_taken
.
Because of this, I've often found the simplest way for me to wrap my brain around the renaming and values and such is when I'm not certain, I copy a join column into a new column and join using that column, then delete it post-merge. It's sloppy and lazy, but I've caught myself too many times missing something and thinking it was not what I had thought.
This might be a little more direct if SQL seems more intuitive.
sqldf::sqldf(
"select ob.*, ad.admission_id
from observations ob
left join admissions ad on ob.patient_id=ad.patient_id
and ob.dt_taken between ad.startdate and ad.enddate")
# dt_taken patient_id observation value admission_id
# 1 2020-04-13 patient01 Heart rate 69 admission01
Data (already data.table
with POSIXt
, works just as well with sqldf
though regular data.frame
s will work just fine, too):
admissions <- setDT(structure(list(patient_id = "patient01", admission_id = "admission01", startdate = structure(1585713860, class = c("POSIXct", "POSIXt" ), tzone = ""), enddate = structure(1588307039, class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
observations <- setDT(structure(list(dt_taken = structure(1586750400, class = c("POSIXct", "POSIXt"), tzone = ""), patient_id = "patient01", observation = "Heart rate", value = 69L), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
(I use setDT
to repair the fact that we can't pass the .internal.selfref
attribute here.)