Search code examples
rdplyrlaglead

lagging variables by day and creating new row in the process


I'm trying to lag variables by day but many don't have an observation on the previous day. So I need to add an extra row in the process. Dplyr gets me close but I need a way to add a new row in the process and have many thousands of cases. Any thoughts would be much appreciated.

ID<-c(1,1,1,1,2,2)
day<-c(0,1,2,5,1,3)
v<-c(2.2,3.4,1.2,.8,6.4,2)
dat1<-as.data.frame(cbind(ID,day,v))
dat1

  ID day   v
1  1   0 2.2
2  1   1 3.4
3  1   2 1.2
4  1   5 0.8
5  2   1 6.4
6  2   3 2.0

Using dplyr gets me here:

dat2<-
  dat1 %>%
  group_by(ID) %>%
  mutate(v.L = dplyr::lead(v, n = 1, default = NA))
dat2
     ID   day     v   v.L
1     1     0   2.2   3.4
2     1     1   3.4   1.2
3     1     2   1.2   0.8
4     1     5   0.8    NA
5     2     1   6.4   2.0
6     2     3   2.0    NA

But I need to get here:

 ID2<-c(1,1,1,1,1,2,2,2)
 day2<-c(0,1,2,4,5,1,2,3)
 v2<-c(2.2,3.4,1.2,NA,.8,6.4,NA,2)
 v2.L<-c(3.4,1.2,NA,.8,NA,NA,2,NA)
 dat3<-as.data.frame(cbind(ID2,day2,v2,v2.L))
 dat3

   ID2 day2  v2 v2.L
1   1    0 2.2  3.4
2   1    1 3.4  1.2
3   1    2 1.2   NA
4   1    4  NA  0.8
5   1    5 0.8   NA
6   2    1 6.4   NA
7   2    2  NA  2.0
8   2    3 2.0   NA

Solution

  • You could use complete and full_seq from the tidyr package to complete the sequence of days. You'd need to remove at the end the rows that have NA in both v and v.L:

    library(dplyr)
    library(tidyr)
    
    dat2 = dat1 %>%
           group_by(ID) %>%
           complete(day = full_seq(day,1)) %>%
           mutate(v.L = lead(v)) %>%
           filter(!(is.na(v) & is.na(v.L)))
    
    
        ID   day     v   v.L
      <dbl> <dbl> <dbl> <dbl>
         1     0   2.2   3.4
         1     1   3.4   1.2
         1     2   1.2    NA
         1     4    NA   0.8
         1     5   0.8    NA
         2     1   6.4    NA
         2     2    NA   2.0
         2     3   2.0    NA