I have data of the following form.(also see below)
The first variable V1 describes the type of observation- those labelled H give information for all values below until another H observation. The second variable, for all other observations, says how many observations above is the H that describes the variable.
I want to compute a variable that is the lagged value of the third variable V3, where the number of lags is equal to the value of variable V2. Is there a way to use the lag() function in dplyr where I can feed it a number n of lags that changes with the row of my data set? Any other way to compute a lag that changes with the row? Ideally I would like to use methods from dplyr since I am applying this to large data for which sparklyr is useful.
Edit: output of dput of the first 25 observations of my data:
structure(list(V1 = c("H", "L", "S", "S", "S", "S", "S", "S",
"S", "S", "H", "L", "S", "S", "S", "S", "S", "S", "S", "S", "H",
"L", "S", "S", "S"), V2 = c(723L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 723L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 723L, 1L, 2L,
3L, 4L), V3 = c("O2:CH\\A20\\7250.0", "1574721653.7385", "1574721662.9515",
"1574729100.1103", "1574729880.0359", "1574730000.0388", "1574757000.0533",
"1574757240.0648", "1574757300.0335", "1574757360.039", "O2:CH\\A20\\7300.0",
"1574721653.7385", "1574721662.9515", "1574729100.1103", "1574729880.0359",
"1574730000.0388", "1574757000.0533", "1574757240.0648", "1574757300.0335",
"1574757360.039", "O2:CH\\A20\\7350.0", "1574721653.7385", "1574721662.9515",
"1574729100.1103", "1574729880.0359"), V4 = c("USD", "1", "0",
"2", "3", "5", "9", "3", "12", "13", "USD", "1", "0", "2", "3",
"5", "9", "3", "12", "13", "USD", "1", "0", "2", "3"), V5 = c("6",
"", "", "", "", "", "", "", "", "", "6", "", "", "", "", "",
"", "", "", "", "6", "", "", "", "")), row.names = c(NA, 25L), class = "data.frame")
lag
can only accept a scalar value but here's a little trick that we can use :
library(dplyr)
df %>%
mutate(inds = row_number() - lag,
c = a[replace(inds, inds < 1, NA)])
# a lag inds c
#1 1 3 -2 NA
#2 2 3 -1 NA
#3 3 3 0 NA
#4 4 2 2 2
#5 5 3 2 2
#6 6 2 4 4
#7 7 2 5 5
#8 8 2 6 6
#9 9 3 6 6
#10 10 1 9 9
c
is the final output we are looking at. Basically, we subtract the current row number with the lag value to get index which is used to subset a
values.
data
set.seed(123)
df <- data.frame(a = 1:10, lag = sample(3, 10, replace = TRUE))
df
# a lag
#1 1 3
#2 2 3
#3 3 3
#4 4 2
#5 5 3
#6 6 2
#7 7 2
#8 8 2
#9 9 3
#10 10 1