Search code examples
rdplyrwindow-functions

dplyr-- how to do aggregate window of last or future N items?


I have the following data:

> str(eth)
'data.frame':   11029 obs. of  10 variables:
 $ Date           : Date, format: "2017-07-01" "2017-07-01" "2017-07-01" "2017-07-01" ...
 $ Symbol         : Factor w/ 1 level "ETHUSD": 1 1 1 1 1 1 1 1 1 1 ...
 $ Open           : num  264 257 258 258 260 ...
 $ High           : num  265 264 261 262 261 ...
 $ Low            : num  260 256 254 257 253 ...
 $ Close          : num  263 264 257 258 258 ...
 $ Volume.From    : num  7221 4975 10747 9118 15402 ...
 $ Volume.To      : num  1902503 1290128 2765561 2366698 3962669 ...
 $ future_12h_high: num  NA NA NA NA NA NA NA NA NA NA ...
 $ past_12h_high  : num  NA NA NA NA NA NA NA NA NA NA ...

And:

> head(eth)
Date Symbol   Open   High    Low  Close Volume.From Volume.To future_12h_high past_12h_high
1 2017-07-01 ETHUSD 263.84 264.97 260.31 263.12     7221.08   1902503              NA            NA
2 2017-07-01 ETHUSD 257.13 264.36 256.03 263.84     4975.12   1290128              NA            NA
3 2017-07-01 ETHUSD 258.17 260.56 254.15 257.13    10746.60   2765561              NA            NA
4 2017-07-01 ETHUSD 258.49 262.00 257.12 258.17     9118.43   2366698              NA            NA
5 2017-07-01 ETHUSD 259.50 260.88 253.23 258.49    15402.48   3962669              NA            NA
6 2017-07-01 ETHUSD 263.51 266.73 255.27 259.50    20821.39   5396852              NA            NA

And:

> dput(head(eth,10))
structure(list(Date = structure(c(17348, 17348, 17348, 17348, 
17348, 17348, 17348, 17348, 17348, 17348), class = "Date"), Symbol = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ETHUSD", class = "factor"), 
    Open = c(263.84, 257.13, 258.17, 258.49, 259.5, 263.51, 268, 
    272.57, 265.74, 268.79), High = c(264.97, 264.36, 260.56, 
    262, 260.88, 266.73, 268.44, 272.57, 272.74, 269.9), Low = c(260.31, 
    256.03, 254.15, 257.12, 253.23, 255.27, 262.39, 267.6, 265, 
    265), Close = c(263.12, 263.84, 257.13, 258.17, 258.49, 259.5, 
    263.51, 268, 272.57, 265.74), Volume.From = c(7221.08, 4975.12, 
    10746.6, 9118.43, 15402.48, 20821.39, 7142.36, 4776.58, 5581.66, 
    6367.05), Volume.To = c(1902503.11, 1290127.76, 2765560.88, 
    2366698.5, 3962669, 5396852.35, 1894983.33, 1287300.75, 1500282.55, 
    1702536.85), future_12h_high = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), past_12h_high = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_)), row.names = c(NA, 10L), class = "data.frame")

And i'm trying to calculate the past and future highs within the last 12 data points.

eth <- read.csv("/Users/micahsmith/Downloads/Gdax_ETHUSD_1h.csv") %>%
  mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
  arrange(Date) %>%
  mutate(future_12h_high = max(lead(High,12)), past_12h_high = max(lag(High,12)))

The code above is incorrect-- it is computing the maxes of the current item and the [item - 12] indexed item. I would like to use all of the last 12 and all of the future 12 items in that.

Summed Up

How do i window with a range of future and past items, not just one value?


Solution

  • I've lowered the window from 12 to 4 since your example data didn't have 12 observations, but here's an approach using RcppRoll:

    # install.packages("RcppRoll")
    library(RcppRoll)
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    
    eth %>% mutate(Date = as.Date(Date, "%Y-%m-%d %I-%p")) %>%
        arrange(Date) %>%
        mutate(future_12h_high = roll_maxl(High, 4),
               past_12h_high = roll_maxr(High, 4))
    #>          Date Symbol   Open   High    Low  Close Volume.From Volume.To
    #> 1  2017-07-01 ETHUSD 263.84 264.97 260.31 263.12     7221.08   1902503
    #> 2  2017-07-01 ETHUSD 257.13 264.36 256.03 263.84     4975.12   1290128
    #> 3  2017-07-01 ETHUSD 258.17 260.56 254.15 257.13    10746.60   2765561
    #> 4  2017-07-01 ETHUSD 258.49 262.00 257.12 258.17     9118.43   2366698
    #> 5  2017-07-01 ETHUSD 259.50 260.88 253.23 258.49    15402.48   3962669
    #> 6  2017-07-01 ETHUSD 263.51 266.73 255.27 259.50    20821.39   5396852
    #> 7  2017-07-01 ETHUSD 268.00 268.44 262.39 263.51     7142.36   1894983
    #> 8  2017-07-01 ETHUSD 272.57 272.57 267.60 268.00     4776.58   1287301
    #> 9  2017-07-01 ETHUSD 265.74 272.74 265.00 272.57     5581.66   1500283
    #> 10 2017-07-01 ETHUSD 268.79 269.90 265.00 265.74     6367.05   1702537
    #>    future_12h_high past_12h_high
    #> 1           264.97            NA
    #> 2           264.36            NA
    #> 3           266.73            NA
    #> 4           268.44        264.97
    #> 5           272.57        264.36
    #> 6           272.74        266.73
    #> 7           272.74        268.44
    #> 8               NA        272.57
    #> 9               NA        272.74
    #> 10              NA        272.74
    

    As stated in the dplyr Window functions vignette, "Rolling aggregates operate in a fixed width window. You won’t find them in base R or in dplyr, but there are many implementations in other packages, such as RcppRoll."

    roll_maxr(High, 4) gets the max of High in a rolling window of length four with right alignment (so the three observations before the current observation and the current observation). roll_maxl(High, 4) does the same thing, but with left alignment (so the three observations after the current observation and the current observation).