Search code examples
rdplyrlagsparklyr

dplyr lag, with number of lags varying by observation


I have data of the following form.(also see below)

The first variable V1 describes the type of observation- those labelled H give information for all values below until another H observation. The second variable, for all other observations, says how many observations above is the H that describes the variable.

I want to compute a variable that is the lagged value of the third variable V3, where the number of lags is equal to the value of variable V2. Is there a way to use the lag() function in dplyr where I can feed it a number n of lags that changes with the row of my data set? Any other way to compute a lag that changes with the row? Ideally I would like to use methods from dplyr since I am applying this to large data for which sparklyr is useful.

Edit: output of dput of the first 25 observations of my data:

structure(list(V1 = c("H", "L", "S", "S", "S", "S", "S", "S", 
"S", "S", "H", "L", "S", "S", "S", "S", "S", "S", "S", "S", "H", 
"L", "S", "S", "S"), V2 = c(723L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 723L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 723L, 1L, 2L, 
3L, 4L), V3 = c("O2:CH\\A20\\7250.0", "1574721653.7385", "1574721662.9515", 
"1574729100.1103", "1574729880.0359", "1574730000.0388", "1574757000.0533", 
"1574757240.0648", "1574757300.0335", "1574757360.039", "O2:CH\\A20\\7300.0", 
"1574721653.7385", "1574721662.9515", "1574729100.1103", "1574729880.0359", 
"1574730000.0388", "1574757000.0533", "1574757240.0648", "1574757300.0335", 
"1574757360.039", "O2:CH\\A20\\7350.0", "1574721653.7385", "1574721662.9515", 
"1574729100.1103", "1574729880.0359"), V4 = c("USD", "1", "0", 
"2", "3", "5", "9", "3", "12", "13", "USD", "1", "0", "2", "3", 
"5", "9", "3", "12", "13", "USD", "1", "0", "2", "3"), V5 = c("6", 
"", "", "", "", "", "", "", "", "", "6", "", "", "", "", "", 
"", "", "", "", "6", "", "", "", "")), row.names = c(NA, 25L), class = "data.frame")

Solution

  • lag can only accept a scalar value but here's a little trick that we can use :

    library(dplyr)
    
    df %>%
      mutate(inds = row_number() - lag, 
             c = a[replace(inds, inds < 1, NA)])
    
    #    a lag inds  c
    #1   1   3   -2 NA
    #2   2   3   -1 NA
    #3   3   3    0 NA
    #4   4   2    2  2
    #5   5   3    2  2
    #6   6   2    4  4
    #7   7   2    5  5
    #8   8   2    6  6
    #9   9   3    6  6
    #10 10   1    9  9
    

    c is the final output we are looking at. Basically, we subtract the current row number with the lag value to get index which is used to subset a values.

    data

    set.seed(123)
    df <- data.frame(a = 1:10, lag = sample(3, 10, replace = TRUE))
    df
    #    a lag
    #1   1   3
    #2   2   3
    #3   3   3
    #4   4   2
    #5   5   3
    #6   6   2
    #7   7   2
    #8   8   2
    #9   9   3
    #10 10   1