Search code examples
rif-statementplyrcalculated-columns

Define new variable to take on 1 if next row of another variable fulfills condition


so I´m trying to set up my dataset for event-history analysis and for this I need to define a new column. My dataset is of the following form:

ID   Var1
1    10
1    20  
1    30  
1    10
2    4
2    5
2    10
2    5
3    1
3    15
3    20
3    9
4    18
4    32
4    NA
4    12
5    2
5    NA
5    8
5    3

And I want to get to the following form:

ID   Var1   Var2
1    10      0
1    20      0
1    30      1
1    10      0
2    4       0
2    5       0
2    10      0
2    5       0
3    1       0
3    15      0
3    20      1
3    9       0
4    18      0
4    32      NA
4    NA      1
4    12      0
5    2       NA
5    NA      0
5    8       1
5    3       0

So in words: I want the new variable to indicate, if the value of Var1 (with respect to the group) drops below 50% of the maximum value Var1 reaches for that group. Whether the last value is NA or 0 is not really of importance, although NA would make more sense from a theoretical perspective. I´ve tried using something like

DF$Var2 <- df %>%
  group_by(ID) %>%
  ifelse(df == ave(df$Var1,df$ID, FUN = max), 0,1)

to then lag it by 1, but it returns an error on an unused argument 1 in ifelse.

Thanks for your solutions!


Solution

  • Here is a base R option via ave + cummax

    within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
    

    which gives

    > within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
       ID Var1 Var2
    1   1   10    0
    2   1   20    0
    3   1   30    1
    4   1   10    0
    5   2    4    0
    6   2    5    0
    7   2   10    0
    8   2    5    0
    9   3    1    0
    10  3   15    0
    11  3   20    1
    12  3    9    0
    

    Data

    > dput(df)
    structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L), Var1 = c(10L, 20L, 30L, 10L, 4L, 5L, 10L, 5L, 1L, 15L,
    20L, 9L)), class = "data.frame", row.names = c(NA, -12L))
    

    Edit (for updated post)

    f <- function(v) {
      u1 <- c(replace(v,!is.na(v),0),0)[-1]
      v[is.na(v)] <- v[which(is.na(v))-1]
      u2 <- c((v<max(v)/2 & cummax(v)==max(v))[-1],0)
      u1+u2
    }
    
    within(df,Var2 <- ave(Var1,ID,FUN = f))
    

    such that

    > within(df,Var2 <- ave(Var1,ID,FUN = f))
       ID Var1 Var2
    1   1   10    0
    2   1   20    0
    3   1   30    1
    4   1   10    0
    5   2    4    0
    6   2    5    0
    7   2   10    0
    8   2    5    0
    9   3    1    0
    10  3   15    0
    11  3   20    1
    12  3    9    0
    13  4   18    0
    14  4   32   NA
    15  4   NA    1
    16  4   12    0
    17  5    2   NA
    18  5   NA    0
    19  5    8    1
    20  5    3    0
    

    Data

    df <- tructure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,    
    3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), Var1 = c(10L, 20L, 30L, 
    10L, 4L, 5L, 10L, 5L, 1L, 15L, 20L, 9L, 18L, 32L, NA, 12L, 2L,   
    NA, 8L, 3L)), class = "data.frame", row.names = c(NA, -20L))