so I´m trying to set up my dataset for event-history analysis and for this I need to define a new column. My dataset is of the following form:
ID Var1
1 10
1 20
1 30
1 10
2 4
2 5
2 10
2 5
3 1
3 15
3 20
3 9
4 18
4 32
4 NA
4 12
5 2
5 NA
5 8
5 3
And I want to get to the following form:
ID Var1 Var2
1 10 0
1 20 0
1 30 1
1 10 0
2 4 0
2 5 0
2 10 0
2 5 0
3 1 0
3 15 0
3 20 1
3 9 0
4 18 0
4 32 NA
4 NA 1
4 12 0
5 2 NA
5 NA 0
5 8 1
5 3 0
So in words: I want the new variable to indicate, if the value of Var1
(with respect to the group) drops below 50% of the maximum value Var1
reaches for that group. Whether the last value is NA or 0 is not really of importance, although NA
would make more sense from a theoretical perspective.
I´ve tried using something like
DF$Var2 <- df %>%
group_by(ID) %>%
ifelse(df == ave(df$Var1,df$ID, FUN = max), 0,1)
to then lag it by 1, but it returns an error on an unused argument 1 in ifelse.
Thanks for your solutions!
Here is a base R option via ave
+ cummax
within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
which gives
> within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
ID Var1 Var2
1 1 10 0
2 1 20 0
3 1 30 1
4 1 10 0
5 2 4 0
6 2 5 0
7 2 10 0
8 2 5 0
9 3 1 0
10 3 15 0
11 3 20 1
12 3 9 0
Data
> dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), Var1 = c(10L, 20L, 30L, 10L, 4L, 5L, 10L, 5L, 1L, 15L,
20L, 9L)), class = "data.frame", row.names = c(NA, -12L))
Edit (for updated post)
f <- function(v) {
u1 <- c(replace(v,!is.na(v),0),0)[-1]
v[is.na(v)] <- v[which(is.na(v))-1]
u2 <- c((v<max(v)/2 & cummax(v)==max(v))[-1],0)
u1+u2
}
within(df,Var2 <- ave(Var1,ID,FUN = f))
such that
> within(df,Var2 <- ave(Var1,ID,FUN = f))
ID Var1 Var2
1 1 10 0
2 1 20 0
3 1 30 1
4 1 10 0
5 2 4 0
6 2 5 0
7 2 10 0
8 2 5 0
9 3 1 0
10 3 15 0
11 3 20 1
12 3 9 0
13 4 18 0
14 4 32 NA
15 4 NA 1
16 4 12 0
17 5 2 NA
18 5 NA 0
19 5 8 1
20 5 3 0
Data
df <- tructure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), Var1 = c(10L, 20L, 30L,
10L, 4L, 5L, 10L, 5L, 1L, 15L, 20L, 9L, 18L, 32L, NA, 12L, 2L,
NA, 8L, 3L)), class = "data.frame", row.names = c(NA, -20L))