Search code examples
rjointime-seriesdata.table

Fast data.table joins with time series


I have two data sets with currency data prices (High, Low, Open and Close) on two different timeframes (1 hour and 5 minute).

I have price targets for each row of the 1 Hour timeframe data, and a target direction for the trade (either 1 or -1).

For trades with a direction of '1', I'm trying to find the first point on the 5 minute chart where the "low" is <= the target level. Conversely, for trades with a direction of '-1', I want to find the first point on the 5-minute chart where the "high" is >= the target trade level.

I've put the code below to demonstrate what I'm after.

The problem I'm having in practice is that I'm looking at time periods spanning 10 - 20 years, which makes the joins very slow. I've set the example below to be from 2016 to 2022 so it's not too slow on my computer, but if you extend it out to more than 10 years it starts to really become a pain.

This may be something that I have to live with, but I'm looking for guidance on two things:

  1. Is there a faster / more efficient way of achieving what I've outlined below?
  2. You can see that I've separated the buy and sell trades into two joins at the end. Is there a way to combine that into one single join? It's not really that important as it doesn't take up too much 'space' in my code, but I'm interested for educational purposes.

Thanks heaps in advance, Phil

# Load required packages
library(data.table)
library(dplyr)

# Define timeframes for the data
Start <- as.POSIXct("2016-01-01 00:00:00")
End <- as.POSIXct("2022-01-01 23:55:00")

Hours <- floor(as.numeric(difftime(End,Start,units = "hours"))) + 1
Minutes <- floor(as.numeric(difftime(End,Start,units = "mins")) / 5) + 1

# Create the Hourly data
set.seed(123)
hourly_prices <- data.table(
  datetime = seq(Start, End, by = "hour"),
  open = rnorm(Hours, mean = 100, sd = 1),
  high = rnorm(Hours, mean = 101, sd = 1),
  low = rnorm(Hours, mean = 99, sd = 1),
  close = rnorm(Hours, mean = 100, sd = 1),
  Direction = sample(c(1,-1),Hours,replace = T)) %>%
  .[,Target_price := ifelse(Direction == -1,rnorm(.N, mean = 104, sd = 1),rnorm(.N,mean = 97,sd = 1))]

# Create the 5-minute data
set.seed(456)
minute_prices <- data.table(
  datetime = seq(Start, End, by = "5 min"),
  open = rnorm(Minutes, mean = 100, sd = 1),
  high = rnorm(Minutes, mean = 101, sd = 1),
  low = rnorm(Minutes, mean = 99, sd = 1),
  close = rnorm(Minutes, mean = 100, sd = 1),
  Position = seq_len(Minutes))

# Join the two data.tables to find the first point at which price passes the target levels
hourly_prices[(Direction == 1),Location := minute_prices[.SD, on = .(datetime > datetime, low <= Target_price),mult = "first",x.Position]]
hourly_prices[(Direction == -1),Location := minute_prices[.SD, on = .(datetime > datetime, high >= Target_price),mult = "first",x.Position]]

Solution

  • One simple ~3x speed-up would be to subset the minute_prices data.table to include only those rows that are the cumulative min/max for the current hour (minus 1 second, since the join is datetime > datetime):

    dtM <- copy(minute_prices)
    dtH <- copy(hourly_prices)
    
    system.time({
      dtH[(Direction == 1),Location := dtM[dtM[, low == cummin(low), as.integer(datetime - 1)%/%3600L][[2]]][.SD, on = .(datetime > datetime, low <= Target_price),mult = "first",x.Position]]
      dtH[(Direction == -1),Location := dtM[dtM[, high == cummax(high), as.integer(datetime - 1)%/%3600L][[2]]][.SD, on = .(datetime > datetime, high >= Target_price),mult = "first",x.Position]]
    })
    #>    user  system elapsed 
    #>    6.77    0.01    6.81
    
    system.time({
      hourly_prices[(Direction == 1),Location := minute_prices[.SD, on = .(datetime > datetime, low <= Target_price),mult = "first",x.Position]]
      hourly_prices[(Direction == -1),Location := minute_prices[.SD, on = .(datetime > datetime, high >= Target_price),mult = "first",x.Position]]
    })
    #>    user  system elapsed 
    #>   22.97    0.00   23.04
    
    identical(dtH, hourly_prices)
    #> [1] TRUE