Search code examples
sqlrdatetimedata.tablemerging-data

Efficiently joining two data.tables in R (or SQL tables) on date ranges?


I'm using hospital admissions data, and trying to tie together a table of observations:

dt_taken patient_id observation value
2020-04-13 00:00:00 patient01 "Heart rate" 69
...

... with a table of admissions:

patient_id admission_id startdate enddate
patient01 admission01 2020-04-01 00:04:20 2020-05-01 00:23:59
...

... such that it returns a list of observations tied to admissions, and rejects all those that were not made during an admission (eroneously recorded, or taken in outpatient visits etc):

dt_taken admission_id observation value
2020-04-13 00:00:00 admission01 "Heart rate" 69
...

My relatively simplistic approach to-date has been to iterate over each patient, then each patient's admission, then each observation and assign it to that admission, but given I have >36k admissions and >1million observations this is incredibly time-consuming (and my government-issued laptop hates me for it).

Is there a more efficient way that I'm missing, either using {data.table} (I must confess to being an absolute newbie here, preferring to work in {tidyverse}) or even that I can run on the SQL server where the tables are stored to save my aging laptop?


Solution

  • data.table

    For data.table, this is mostly a dupe of How to perform join over date ranges using data.table?, though that doesn't provide the RHS[LHS, on=.(..)] method.

    observations
    #              dt_taken patient_id observation value
    # 1 2020-04-13 00:00:00  patient01  Heart rate    69
    admissions
    #   patient_id admission_id           startdate             enddate
    # 1  patient01  admission01 2020-04-01 00:04:20 2020-05-01 00:23:59
    
    ### convert to data.table
    setDT(observations)
    setDT(admissions)
    
    ### we need proper 'POSIXt' objects
    observations[, dt_taken := as.POSIXct(dt_taken)]
    admissions[, (dates) := lapply(.SD, as.POSIXct), .SDcols = dates]
    

    And the join.

    admissions[observations, on = .(patient_id, startdate <= dt_taken, enddate >= dt_taken)]
    #    patient_id admission_id  startdate    enddate observation value
    #        <char>       <char>     <POSc>     <POSc>      <char> <int>
    # 1:  patient01  admission01 2020-04-13 2020-04-13  Heart rate    69
    

    Two things that I believe are noteworthy:

    • in SQL (and similarly in other join-friendly languages), it is often shown as

      select ...
      from TABLE1 left join TABLE2 ...
      

      suggesting that TABLE1 is the LHS (left-hand side) and TABLE2 is the RHS table. (This is a gross generalization, mostly gearing towards a left-join since that's all that data.table::[ supports; for inner/outer/full joins, you'll need merge(.) or other external mechanisms. See How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 for more discussion on JOINs, etc.)

      From this, data.table::['s mechanism is effectively

      TABLE2[TABLE1, on = .(...)]
      RHS[LHS, on = .(...)]
      

      (Meaning that the right-hand-side table is actually the first table from left-to-right ...)

    1. The names in the output of inequi-joins are preserved from the RHS, see that dt_taken is not found. However, the values of those startdate and enddate columns are from dt_taken.

      Because of this, I've often found the simplest way for me to wrap my brain around the renaming and values and such is when I'm not certain, I copy a join column into a new column and join using that column, then delete it post-merge. It's sloppy and lazy, but I've caught myself too many times missing something and thinking it was not what I had thought.

    sqldf

    This might be a little more direct if SQL seems more intuitive.

    sqldf::sqldf(
      "select ob.*, ad.admission_id
       from observations ob
         left join admissions ad on ob.patient_id=ad.patient_id
             and ob.dt_taken between ad.startdate and ad.enddate")
    #     dt_taken patient_id observation value admission_id
    # 1 2020-04-13  patient01  Heart rate    69  admission01
    

    Data (already data.table with POSIXt, works just as well with sqldf though regular data.frames will work just fine, too):

    admissions <- setDT(structure(list(patient_id = "patient01", admission_id = "admission01", startdate = structure(1585713860, class = c("POSIXct", "POSIXt" ), tzone = ""), enddate = structure(1588307039, class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
    observations <- setDT(structure(list(dt_taken = structure(1586750400, class = c("POSIXct", "POSIXt"), tzone = ""), patient_id = "patient01", observation = "Heart rate", value = 69L), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
    

    (I use setDT to repair the fact that we can't pass the .internal.selfref attribute here.)