Search code examples
rdata.tablenon-equi-join

Add column to data.table specifying if time is between time points in another table


I have a table (dt1) of fairly regular time points, like this:

      DATE        TIME    READING
2022-02-02    11:50:23     123.34
2022-02-02    11:50:59     125.66
2022-02-02    11:51:16     159.23
2022-02-02    11:52:34     234.22

etc... Note that the interval between time points can be very irregular.

I have another table (dt2) specifying the start and end times of treatments:

      DATE    RED_START     RED_END    GREEN_START    GREEN_END
2022-02-02     11:50:30    11:51:12       11:52:12     11:53:17
2022-02-02     11:54:10    11:55:09       11:56:30     11:57:15

It specifies when the treatment was "RED" and when it was "GREEN". Note that there are certain time periods that are neither "RED" or "GREEN".

I'm trying to add a column to the first table to say what the treatment was at the time point given for each row. The desired output would look like this:

      DATE        TIME    READING   TREATMENT
2022-02-02    11:50:23     123.34        NULL
2022-02-02    11:50:59     125.66         RED
2022-02-02    11:51:16     159.23        NULL
2022-02-02    11:52:34     234.22       GREEN

I've tried using a non-equi join between the two tables to generate helper columns. e.g.

dt1[, RED := c("TRUE","FALSE")[
    dt2[.SD, on=.(DATE, RED_START<=TIME, RED_END>=TIME),
               by=.EACHI]
]]

But I've not got the hang of the syntax and it's not working. I'm not sure if I'm stating the multiple conditions incorrectly, or if the non-equi join requires additional info. I would prefer answers using data.table if possible, to keep it consistent with the rest of my code.


Solution

  • This is a case for data.table::foverlaps().

    However, your data in an odd format, particularly dt2.

    Desired format

    You need dt1 to have a combined datetime column, and a dummy end time (to create a window to overlap with dt2).

    #          DATE     TIME READING            datetime               dummy
    #        <IDat>  <ITime>   <num>              <POSc>              <POSc>
    # 1: 2022-02-02 11:50:23  123.34 2022-02-02 11:50:23 2022-02-02 11:50:23
    # 2: 2022-02-02 11:50:59  125.66 2022-02-02 11:50:59 2022-02-02 11:50:59
    # 3: 2022-02-02 11:51:16  159.23 2022-02-02 11:51:16 2022-02-02 11:51:16
    # 4: 2022-02-02 11:52:34  234.22 2022-02-02 11:52:34 2022-02-02 11:52:34
    

    dt2 is in a bizarre half-long, half-wide format. It should look like this:

    #     color               start                 end
    #    <char>              <POSc>              <POSc>
    # 1:    red 2022-02-02 11:50:30 2022-02-02 11:51:12
    # 2:  green 2022-02-02 11:52:12 2022-02-02 11:53:17
    # 3:    red 2022-02-02 11:54:10 2022-02-02 11:55:09
    # 4:  green 2022-02-02 11:56:30 2022-02-02 11:57:15
    

    Getting the data into shape

    I love data.table, but this is one of those data wrangling cases where the tidyverse would be easier. This works but it's not beautiful:

    as_datetime <- function(x, y) {
        as.POSIXct(paste(x, y))
    }
    
    dt1[, datetime := as_datetime(DATE, TIME)][
        ,
        dummy := datetime
    ]
    
    time_cols <- grep("START|END", names(dt2), value = TRUE)
    dt2 <- dt2[, (tolower(time_cols)) := lapply(
        .SD,
        \(y) as_datetime(DATE, y)
    ), .SDcols = time_cols] |>
        melt(
            id.vars = "DATE",
            measure.vars = patterns("[a-z]_[a-z]")
        )
    dt2[
        ,
        c("color", "period") := tstrsplit(variable, "_")
    ]
    
    dt2 <- dcast(dt2, color ~ period, fun.aggregate = list)[, .(
        color = rep(color, times = lengths(end)),
        start = as.POSIXct(unlist(start), origin = "1970-01-01"),
        end = as.POSIXct(unlist(end), origin = "1970-01-01")
    )]
    

    The join

    Once the data is in the right format, it's simply a case of:

    setkey(dt2, start, end)
    foverlaps(dt1, dt2, by.x = c("datetime", "dummy"))[
        ,
        .(datetime, READING, start, end, color)
    ]
    
    #               datetime READING               start                 end  color
    #                 <POSc>   <num>              <POSc>              <POSc> <char>
    # 1: 2022-02-02 11:50:23  123.34                <NA>                <NA>   <NA>
    # 2: 2022-02-02 11:50:59  125.66 2022-02-02 11:50:30 2022-02-02 11:51:12    red
    # 3: 2022-02-02 11:51:16  159.23                <NA>                <NA>   <NA>
    # 4: 2022-02-02 11:52:34  234.22 2022-02-02 11:52:12 2022-02-02 11:53:17  green