Search code examples
rtimestampdata.tableleft-jointidyr

Left join large time series dataset when within timestamp range in R


I need to join two datasets together:

  1. time series data taken every second with a heart rate measure (with 1.4 million records)
>allsecsHR
timestamp               HRbpm
2023-03-22 09:04:53     101
2023-03-22 09:04:54     124
2023-03-22 09:04:55     103
2023-03-22 09:04:56     111
2023-03-22 09:04:57     112
2023-03-22 09:04:58     143
2023-03-22 09:04:59     109
2023-03-22 09:05:00     129
2023-03-22 09:05:01     122
2023-03-22 09:05:02     125
2023-03-22 09:05:03     110
  1. behavioural records with a start and end timestamp
>bhr
Behaviour  BhrTimeStart         BhrTimeEnd
Forage     2023-03-22 09:04:53  2023-03-22 09:04:58
Vigilance  2023-03-22 09:04:58  2023-03-22 09:05:03
Forage     2023-03-22 09:05:03  2023-03-22 09:05:10

At the end I would like a dataset that has each row representing one second, with the behaviour that was being performed per second

I have tried to do this in mySQL and in R using sqldf and powerjoin but it keeps losing connection to the sql server after running for 2 hours or in R it does not complete and crashes or returns Error: vector memory exhausted (limit reached?). I would really appreciate some help to find an efficient way to do this!

I think a data.table solution might be the fastest but I am not sure how to do it, a tidy solution would also be great!

So far I have tried:

mySQL/sqldf

library(sqldf)
sqldf("select * from allsecsHR
              left join bhr
              on allsecsHR.timestamp between bhr.BhrTimeStart and bhr.BhrTimeEnd")

powerjoin

library(powerjoin)
power_left_join(
  allsecsHR, bhr, 
  by = ~.x$timestamp > .y$BhrTimeStart & 
    (.x$timestamp < .y$BhrTimeEnd | is.na(.y$BhrTimeEnd)),
  keep = "left")

The output table I would like is this:

timestamp           HRbpm   Bhr         BhrTimeStart        BhrTimeEnd
2023-03-22 09:04:53 101     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:54 124     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:55 103     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:56 111     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:57 112     Forage      2023-03-22 09:04:53   2023-03-22 09:04:58
2023-03-22 09:04:58 143     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:04:59 109     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:05:00 129     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:05:01 122     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:05:02 125     Vigilance   2023-03-22 09:04:58   2023-03-22 09:05:03
2023-03-22 09:05:03 110     Forage      2023-03-22 09:05:03   2023-03-22 09:05:10

Solution

  • If all timestamps in allsecsHR have a corresponding interval in bhr:

    library(data.table)
    
    setDT(bhr)[setDT(allsecsHR)[, ts := timestamp], on = .(BhrTimeStart == ts), roll = TRUE]
    #>           Bhr        BhrTimeStart          BhrTimeEnd           timestamp HRbpm
    #>  1:    Forage 2023-03-22 09:04:53 2023-03-22 09:04:58 2023-03-22 09:04:53   101
    #>  2:    Forage 2023-03-22 09:04:54 2023-03-22 09:04:58 2023-03-22 09:04:54   124
    #>  3:    Forage 2023-03-22 09:04:55 2023-03-22 09:04:58 2023-03-22 09:04:55   103
    #>  4:    Forage 2023-03-22 09:04:56 2023-03-22 09:04:58 2023-03-22 09:04:56   111
    #>  5:    Forage 2023-03-22 09:04:57 2023-03-22 09:04:58 2023-03-22 09:04:57   112
    #>  6: Vigilance 2023-03-22 09:04:58 2023-03-22 09:05:03 2023-03-22 09:04:58   143
    #>  7: Vigilance 2023-03-22 09:04:59 2023-03-22 09:05:03 2023-03-22 09:04:59   109
    #>  8: Vigilance 2023-03-22 09:05:00 2023-03-22 09:05:03 2023-03-22 09:05:00   129
    #>  9: Vigilance 2023-03-22 09:05:01 2023-03-22 09:05:03 2023-03-22 09:05:01   122
    #> 10: Vigilance 2023-03-22 09:05:02 2023-03-22 09:05:03 2023-03-22 09:05:02   125
    #> 11:    Forage 2023-03-22 09:05:03 2023-03-22 09:05:10 2023-03-22 09:05:03   110
    

    If not all timestamps in allsecsHR have a corresponding interval in bhr, a non-equi join will work:

    setDT(bhr)[setDT(allsecsHR)[, ts := timestamp], on = .(BhrTimeStart <= ts, BhrTimeEnd > ts)]
    

    Data:

    allsecsHR <- structure(list(timestamp = structure(c(1679490293, 1679490294, 
    1679490295, 1679490296, 1679490297, 1679490298, 1679490299, 1679490300, 
    1679490301, 1679490302, 1679490303), class = c("POSIXct", "POSIXt"
    ), tzone = ""), HRbpm = c(101L, 124L, 103L, 111L, 112L, 143L, 
    109L, 129L, 122L, 125L, 110L)), row.names = c(NA, -11L), class = c("data.table", "data.frame"))
    
    bhr <- structure(list(Bhr = c("Forage", "Vigilance", "Forage"), BhrTimeStart = structure(c(1679490293, 
    1679490298, 1679490303), class = c("POSIXct", "POSIXt"), tzone = ""), 
        BhrTimeEnd = structure(c(1679490298, 1679490303, 1679490310
        ), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA, 
    -3L), class = c("data.table", "data.frame"))