Search code examples
rreplacelag

lagging, averaging, and replacing variables


I have the first three columns of data. The yest column represents that id did something on the previous day. I'm trying to go from dat to dat2 by adding a new variable "new" that does three things:

  1. copies the value of yest to the previous day. The days are not always consecutive though. so it should only copy if it is the actual previous (day 2 vs day 3) and not just from the next row to the previous row.

  2. the value of yest should be copied to all rows of new with the same id / day combo

  3. if there is more than one value of yest per id / day combo, they should be averaged before populating the new variable.

I've been trying different ifelse and merge combos that are failing miserably. Any help with this would be much appreciated.

    id<-c(1,1,1,1,1,3,3,3,3,3,3,3,8,8,8,8,8)
    day<-c(1,2,2,3,5,0,1,2,3,4,5,5,0,3,4,4,5)
    yest<-c(NA,1,3,NA,NA,1,2,NA,NA,NA,3,NA,NA,NA,NA,3,4)
    dat<-cbind(id,day,yest)
    dat
          id day yest
     [1,]  1   1   NA
     [2,]  1   2    1
     [3,]  1   2    3
     [4,]  1   3   NA
     [5,]  1   5   NA
     [6,]  3   0    1
     [7,]  3   1    2
     [8,]  3   2   NA
     [9,]  3   3   NA
    [10,]  3   4   NA
    [11,]  3   5    3
    [12,]  3   5   NA
    [13,]  8   0   NA
    [14,]  8   3   NA
    [15,]  8   4   NA
    [16,]  8   4    3
    [17,]  8   5    4

    new<-c(2,NA,NA,NA,NA,2,NA,NA,NA,3,NA,NA,NA,3,4,4,NA)
    dat2<-cbind(dat,new)
    dat2

             id day yest new
        [1,]  1   1   NA   2
        [2,]  1   2    1  NA
        [3,]  1   2    3  NA
        [4,]  1   3   NA  NA
        [5,]  1   5   NA  NA
        [6,]  3   0    1   2
        [7,]  3   1    2  NA
        [8,]  3   2   NA  NA
        [9,]  3   3   NA  NA
       [10,]  3   4   NA   3
       [11,]  3   5    3  NA
       [12,]  3   5   NA  NA
       [13,]  8   0   NA  NA
       [14,]  8   3   NA   3
       [15,]  8   4   NA   4
       [16,]  8   4    3   4
       [17,]  8   5    4  NA

Solution

  • library(dplyr)
    
    df <- data.frame(
      id = c(1,1,1,1,1,3,3,3,3,3,3,3,8,8,8,8,8),
      day = c(1,2,2,3,5,0,1,2,3,4,5,5,0,3,4,4,5),
      yest = c(NA,1,3,NA,NA,1,2,NA,NA,NA,3,NA,NA,NA,NA,3,4)
      ) 
    

    First you create a group for each combination of day and id. Then you take the mean of yest, making sure to throw out NAs. Then you subtract 1 from day so you can match it up to the correct day in the original data.

    df_lag <- df %>% 
      group_by(id, day) %>% 
      summarise(new = mean(yest, na.rm = T)) %>% 
      ungroup() %>% 
      mutate(day = day-1)
    
    df_lag
    
    # A tibble: 14 x 3
          id   day   new
       <dbl> <dbl> <dbl>
     1     1     0   NaN
     2     1     1     2
     3     1     2   NaN
     4     1     4   NaN
     5     3    -1     1
     6     3     0     2
     7     3     1   NaN
     8     3     2   NaN
     9     3     3   NaN
    10     3     4     3
    11     8    -1   NaN
    12     8     2   NaN
    13     8     3     3
    14     8     4     4
    

    Here you join them by day and id, using a left_join to throw out the days where there isn't a day in the original dataset (i.e. -1 in lines 5 and 11).

    left_join(df, df_lag)
    
       id day yest new
    1   1   1   NA   2
    2   1   2    1 NaN
    3   1   2    3 NaN
    4   1   3   NA  NA
    5   1   5   NA  NA
    6   3   0    1   2
    7   3   1    2 NaN
    8   3   2   NA NaN
    9   3   3   NA NaN
    10  3   4   NA   3
    11  3   5    3  NA
    12  3   5   NA  NA
    13  8   0   NA  NA
    14  8   3   NA   3
    15  8   4   NA   4
    16  8   4    3   4
    17  8   5    4  NA