I have an R data.frame with a number of columns, one of which contains POSIXct timestamp records. I want to add a column to the data.frame that, for each row, contains the number of records that have a timestamp between that row's timestamp and one second into the future.
The following code implements this, but it's really slow on the data I'm processing (often 60K+ records). I'd like to know if there is a way to speed this up.
# Create a data frame with POSIXct values spread over a few minutes.
# The actual number of records can be over 60,000.
set.seed(1234)
times <- as.POSIXct("2015-02-18 11:39:17.104206 AEDT") +
runif(n = 10000, min = -5*60, max = 5*60)
times <- sort(times) # my source data comes to me sorted
times <- data.frame(datetime = times)
# For each event (timestamp), calculate how many events (timestamps) appear in
# a one second window following that event.
system.time(
for (i in 1:length(times$datetime)) {
times$eventCount[i] <- sum(
times$datetime >= times$datetime[i] &
times$datetime < times$datetime[i] + 1)
}
)
The result on my system is:
user system elapsed
8.10 0.00 8.21
Interestingly, the processing time does not scale linearly with the number of records. For 20K records, the user time is 24.74 seconds.
Looking at similar questions (like this one and the referenced questions therein) would suggest that using a data.table should speed things up considerably, but I can't bridge the gap between the code in those answers (which look at a fixed number of records either side of a given record) to what I need (looking at an unknown number of records to one side of a given record).
rcpp looks like it would be the best way to go, but I don't know any c++ at all.
Any help appreciated!
Simpler logic based on @Kashaa's Rcpp solution.
require(dplyr)
require(data.table)
set.seed(1234L)
dt = data.table(datetime=as.POSIXct("2015-02-18 11:39:17.104206 AEDT") +
runif(n = 100000, min = -5*60, max = 5*60), key="datetime")
df = as.data.frame(dt)
setNumericRounding(0L)
betweendt <- function(x, col, eps) {
idx1 = dt[.(col), mult="first", roll=-Inf, which=TRUE]
idx2 = dt[.(col+1-eps-unclass(col)*eps),
mult="last", roll=Inf, which=TRUE]
idx2-idx1+1L
}
system.time({
dt[, eventC := betweendt(dt, dt$datetime, .Machine$double.eps)]
})
# user system elapsed
# 0.043 0.001 0.045
system.time({
col = df$datetime
df <- df %>%
rowwise() %>%
mutate(eventC = betweenCpp(col, datetime, datetime+1))
})
# user system elapsed
# 0.142 0.001 0.142
identical(df$eventC, dt$eventC)
# [1] TRUE
data.table
solution is ~3x faster here.
Refer to history for the older version using foverlaps()
(which was an overkill).