Search code examples
rdplyrgroup-bylaglead

r lag and lead indicator by group


I have a dataset with outcome variable - owl weight[y] and time in years when the outcome was recorded[t]. There is a third variable what type of feed was provided to the baby owls[w] and 4th column what year the new feed was introduced[a].

  y       t      w       a
 -1.354   1994   Zi      2015
  2.146   1994   Zi      2015
  3.146   2001   Zi      2015
  2.646   2011   Zi      2015
 -2.354   2015   Zi      2015
 -0.854   2016   Zi      2015
 -3.854   2017   Zi      2015
 -2.354   2021   Zi      2015 
 -2.354   2001   Ca      2010 
  1.146   2002   Ca      2010 
 -2.354   2003   Ca      2010
 -0.854   2005   Ca      2010
 -6.354   2011   Ca      2010
  2.146   2019   Ca      2010

Step1 : I like to create a time to event variable column from the index a column, like this, t-a=time to event variable

  y       t      w       a     TimeToEvent
 -1.354   1994   Zi      2015  -21
  2.146   1994   Zi      2015  -21
  3.146   2001   Zi      2015  -14
  2.646   2011   Zi      2015  -4
 -2.354   2015   Zi      2015   0
 -0.854   2016   Zi      2015   1
 -3.854   2017   Zi      2015   2
 -2.354   2021   Zi      2015   6
 -2.354   2001   Ca      2010  -9
  1.146   2002   Ca      2010  -8
 -2.354   2003   Ca      2010  -7
 -0.854   2005   Ca      2010  -5
  1.147   2010   Ca      2010   0
 -6.354   2011   Ca      2010   1
  2.146   2019   Ca      2010   9

Step2 : I like to create 4 lag and 4 lead variables like this

  y       t      w       a     TimeToEvent  Lg4  Lg3  Lg2  Lg1  Index  Ld1  Ld2  Ld3  Ld4
 -1.354   1994   Zi      2015  -21          1    0    0    0    0      0    0    0    0  
  2.146   1998   Zi      2015  -17          1    0    0    0    0      0    0    0    0  
 -3.146   2001   Zi      2015  -14          1    0    0    0    0      0    0    0    0 
  2.646   2011   Zi      2015  -4           1    0    0    0    0      0    0    0    0 
  0.133   2013   Zi      2015  -2           0    0    1    0    0      0    0    0    0 
 -2.354   2015   Zi      2015   0           0    0    0    0    1      0    0    0    0 
 -0.854   2016   Zi      2015   1           0    0    0    0    0      1    0    0    0
 -3.854   2017   Zi      2015   2           0    0    0    0    0      0    1    0    0
 -2.354   2021   Zi      2015   6           0    0    0    0    0      0    0    0    1
 -2.354   2001   Ca      2010  -9           1    0    0    0    0      0    0    0    0 
  1.146   2002   Ca      2010  -8           1    0    0    0    0      0    0    0    0 
 -2.354   2003   Ca      2010  -7           1    0    0    0    0      0    0    0    0 
 -0.854   2005   Ca      2010  -5           1    0    0    0    0      0    0    0    0 
  1.147   2010   Ca      2010   0           0    0    0    0    1      0    0    0    0 
 -6.354   2011   Ca      2010   1           0    0    0    0    0      1    0    0    0
  2.146   2019   Ca      2010   9           0    0    0    0    0      0    0    0    1

As you can see, I am only interested in tracking the lead and lag for 4 time periods before the index and after the index year [column a].

  y       t      w       a     TimeToEvent  Lg4  Lg3  Lg2  Lg1  Index  Ld1  Ld2  Ld3  Ld4
  .       .      .       .      .           .    .    .    .    .      .    .    .    .  
  .       .      .       .      .           .    .    .    .    .      .    .    .    . 
  2.646   2011   Zi      2015  -4           1    0    0    0    0      0    0    0    0 
  0.133   2013   Zi      2015  -2           0    0    1    0    0      0    0    0    0 
 -2.354   2015   Zi      2015   0           0    0    0    0    1      0    0    0    0 
 -0.854   2016   Zi      2015   1           0    0    0    0    0      1    0    0    0
 -3.854   2017   Zi      2015   2           0    0    0    0    0      0    1    0    0
  .       .      .       .      .           .    .    .    .    .      .    .    .    .  
  .       .      .       .      .           .    .    .    .    .      .    .    .    . 
  1.147   2010   Ca      2010   0           0    0    0    0    1      0    0    0    0 
 -6.354   2011   Ca      2010   1           0    0    0    0    0      1    0    0    0

Any observations occurring outside the 4 lead or 4 lag periods are assigned 1 depending on whether they are t < a or t > a for example the first three observations time to event is -12,-17 and -14 so these observations will be assigned 1 in Lg4 column (boundary) similarly first 4 observations of group(wa)=Ca is -9,-8,-7,-5 so they are assigned 1 in the in Lg4 column

  y       t      w       a     TimeToEvent  Lg4  Lg3  Lg2  Lg1  Index  Ld1  Ld2  Ld3  Ld4
 -1.354   1994   Zi      2015  -21          1    0    0    0    0      0    0    0    0  
  2.146   1998   Zi      2015  -17          1    0    0    0    0      0    0    0    0  
 -3.146   2001   Zi      2015  -14          1    0    0    0    0      0    0    0    0 

 -2.354   2001   Ca      2010  -9           1    0    0    0    0      0    0    0    0 
  1.146   2002   Ca      2010  -8           1    0    0    0    0      0    0    0    0 
 -2.354   2003   Ca      2010  -7           1    0    0    0    0      0    0    0    0 
 -0.854   2005   Ca      2010  -5           1    0    0    0    0      0    0    0    0 

Observations with time to event outside lead will be assigned 1 on the other end( column Ld4)

  y       t      w       a     TimeToEvent  Lg4  Lg3  Lg2  Lg1  Index  Ld1  Ld2  Ld3  Ld4
 -2.354   2021   Zi      2015   6           0    0    0    0    0      0    0    0    1
  2.146   2019   Ca      2010   9           0    0    0    0    0      0    0    0    1

Final Expected dataset as shown in Step2 above

  y       t      w       a     TimeToEvent  Lg4  Lg3  Lg2  Lg1  Index  Ld1  Ld2  Ld3  Ld4
 -1.354   1994   Zi      2015  -21          1    0    0    0    0      0    0    0    0  
  2.146   1998   Zi      2015  -17          1    0    0    0    0      0    0    0    0  
 -3.146   2001   Zi      2015  -14          1    0    0    0    0      0    0    0    0 
 
  2.646   2011   Zi      2015  -4           1    0    0    0    0      0    0    0    0 
  0.133   2013   Zi      2015  -2           0    0    1    0    0      0    0    0    0 
 -2.354   2015   Zi      2015   0           0    0    0    0    1      0    0    0    0 
 -0.854   2016   Zi      2015   1           0    0    0    0    0      1    0    0    0
 -3.854   2017   Zi      2015   2           0    0    0    0    0      0    1    0    0
 
 -2.354   2021   Zi      2015   6           0    0    0    0    0      0    0    0    1
 
 -2.354   2001   Ca      2010  -9           1    0    0    0    0      0    0    0    0 
  1.146   2002   Ca      2010  -8           1    0    0    0    0      0    0    0    0 
 -2.354   2003   Ca      2010  -7           1    0    0    0    0      0    0    0    0 
 -0.854   2005   Ca      2010  -5           1    0    0    0    0      0    0    0    0 
 
  1.147   2010   Ca      2010   0           0    0    0    0    1      0    0    0    0 
 -6.354   2011   Ca      2010   1           0    0    0    0    0      1    0    0    0
 
  2.146   2019   Ca      2010   9           0    0    0    0    0      0    0    0    1

Any suggestions regarding this is much appreciated. Thanks.


Solution

  • It seems that you are only testing upon your time to event so a bunch of ifelse will do the trick.

    If you want to replace missing values there, I suggest you use dplyr::across() to apply tidyr::replace_na() to all those columns. You could also use dplyr::case_when() instead of ifelse().

    Also, for your future questions, please try to provide the dataset in the form of pasteable code such as in my answer (or even better using dput()).

    Here is the code:

    library(tidyverse)
    df=read.table(header=T, text="
    y       t      w       a
    -1.354   1994   Zi      2015
    2.146   1994   Zi      2015
    3.146   2001   Zi      2015
    2.646   2011   Zi      2015
    -2.354   2015   Zi      2015
    -0.854   2016   Zi      2015
    -3.854   2017   Zi      2015
    -2.354   2021   Zi      2015 
    -2.354   2001   Ca      2010 
    1.146   2002   Ca      2010 
    -2.354   2003   Ca      2010
    -0.854   2005   Ca      NA
    -6.354   2011   Ca      2010
    2.146   2019   Ca      2010")
    
    df %>% 
        mutate(
            tte=t-a, 
            Lg4=ifelse(tte<=-4, 1, 0), 
            Lg3=ifelse(tte==-3, 1, 0), 
            Lg2=ifelse(tte==-2, 1, 0), 
            Lg1=ifelse(tte==-1, 1, 0), 
            Index=ifelse(tte==0, 1, 0), 
            Ld1=ifelse(tte==1, 1, 0),
            Ld2=ifelse(tte==2, 1, 0), 
            Ld3=ifelse(tte==3, 1, 0), 
            Ld4=ifelse(tte>=4, 1, 0), 
            across(Lg4:Ld4, replace_na, replace=0)
        )
    #>         y    t  w    a tte Lg4 Lg3 Lg2 Lg1 Index Ld1 Ld2 Ld3 Ld4
    #> 1  -1.354 1994 Zi 2015 -21   1   0   0   0     0   0   0   0   0
    #> 2   2.146 1994 Zi 2015 -21   1   0   0   0     0   0   0   0   0
    #> 3   3.146 2001 Zi 2015 -14   1   0   0   0     0   0   0   0   0
    #> 4   2.646 2011 Zi 2015  -4   1   0   0   0     0   0   0   0   0
    #> 5  -2.354 2015 Zi 2015   0   0   0   0   0     1   0   0   0   0
    #> 6  -0.854 2016 Zi 2015   1   0   0   0   0     0   1   0   0   0
    #> 7  -3.854 2017 Zi 2015   2   0   0   0   0     0   0   1   0   0
    #> 8  -2.354 2021 Zi 2015   6   0   0   0   0     0   0   0   0   1
    #> 9  -2.354 2001 Ca 2010  -9   1   0   0   0     0   0   0   0   0
    #> 10  1.146 2002 Ca 2010  -8   1   0   0   0     0   0   0   0   0
    #> 11 -2.354 2003 Ca 2010  -7   1   0   0   0     0   0   0   0   0
    #> 12 -0.854 2005 Ca   NA   0   0   0   0   0     0   0   0   0   0
    #> 13 -6.354 2011 Ca 2010   1   0   0   0   0     0   1   0   0   0
    #> 14  2.146 2019 Ca 2010   9   0   0   0   0     0   0   0   0   1
    

    Created on 2021-12-27 by the reprex package (v2.0.1)