Search code examples
roptimizationdataframedata.tableself-reference

Optimizaion of calculating how many data.frame record timestamps appear in a one second window from every record timestamp in that data.frame


I have an R data.frame with a number of columns, one of which contains POSIXct timestamp records. I want to add a column to the data.frame that, for each row, contains the number of records that have a timestamp between that row's timestamp and one second into the future.

The following code implements this, but it's really slow on the data I'm processing (often 60K+ records). I'd like to know if there is a way to speed this up.

# Create a data frame with POSIXct values spread over a few minutes.
# The actual number of records can be over 60,000.
set.seed(1234)
times <- as.POSIXct("2015-02-18 11:39:17.104206 AEDT") + 
    runif(n = 10000, min = -5*60, max = 5*60)
times <- sort(times) # my source data comes to me sorted
times <- data.frame(datetime = times)

# For each event (timestamp), calculate how many events (timestamps) appear in
# a one second window following that event.
system.time(
for (i in 1:length(times$datetime)) {
        times$eventCount[i] <- sum(
                times$datetime >= times$datetime[i] & 
                times$datetime < times$datetime[i] + 1)
}
)

The result on my system is:

user  system elapsed
8.10    0.00    8.21

Interestingly, the processing time does not scale linearly with the number of records. For 20K records, the user time is 24.74 seconds.

Looking at similar questions (like this one and the referenced questions therein) would suggest that using a data.table should speed things up considerably, but I can't bridge the gap between the code in those answers (which look at a fixed number of records either side of a given record) to what I need (looking at an unknown number of records to one side of a given record).

rcpp looks like it would be the best way to go, but I don't know any c++ at all.

Any help appreciated!


Solution

  • Simpler logic based on @Kashaa's Rcpp solution.

    data

    require(dplyr)
    require(data.table)
    set.seed(1234L)
    dt = data.table(datetime=as.POSIXct("2015-02-18 11:39:17.104206 AEDT") + 
        runif(n = 100000, min = -5*60, max = 5*60), key="datetime")
    df = as.data.frame(dt)
    

    data.table solution

    setNumericRounding(0L)
    betweendt <- function(x, col, eps) {
        idx1 = dt[.(col), mult="first", roll=-Inf, which=TRUE]
        idx2 = dt[.(col+1-eps-unclass(col)*eps), 
                    mult="last", roll=Inf, which=TRUE]
        idx2-idx1+1L
    }
    system.time({
    dt[, eventC := betweendt(dt, dt$datetime, .Machine$double.eps)]
    })
    #    user  system elapsed 
    #   0.043   0.001   0.045 
    

    Rcpp version (from @Khashaa)

    system.time({
      col = df$datetime
      df <- df %>% 
        rowwise() %>% 
        mutate(eventC = betweenCpp(col, datetime, datetime+1)) 
    })
    #    user  system elapsed 
    #   0.142   0.001   0.142 
    
    identical(df$eventC, dt$eventC)
    # [1] TRUE
    

    data.table solution is ~3x faster here.


    Refer to history for the older version using foverlaps() (which was an overkill).