Suppose that I have a data frame as follows:
idp<-sort(rep(c("A","B","C","D"),10))
a1<-c(1,1,1,2,3,4,3,4,2,2)
a2<-c(3,3,NA,NA,4,1,2,3,1,1)
a3<-c(NA,NA,1,1,2,2,4,NA,NA,1)
a4<-c(4,3,2,1,NA,NA,NA,1,2,3)
dat<-data.frame(idp,outcome=c(a1,a2,a3,a4))
In dat
, the idp
is an identification of a person. Each value of a1
, ..., a4
represents a status, where 4
indicates "dying"/death.
For any idp
, if a 4
occurs all following values need to be set to 4
. If NA
occurs, we assume that the immediate previous state, which is not NA
, should replace the NA
. Finally, if the sequence starts with NA
, then we should choose the immediate non-missing first state appearing in the vector.
Split based on idp, fill NAs, then find 4 and fill with 4 if any:
#split per idp and loop
l <- lapply(split(dat$outcome, dat$idp), function(i){
# fill NA
out <- zoo::na.locf(zoo::na.locf(i, na.rm = FALSE),
na.rm = FALSE, fromLast = TRUE)
# fill 4 if any
ix4 <- min(which(out == 4))
if(length(ix4) > 0){ out[ ix4:length(out) ] <- 4 }
out
})
l
# $A
# [1] 1 1 1 2 3 4 4 4 4 4
#
# $B
# [1] 3 3 3 3 4 4 4 4 4 4
#
# $C
# [1] 1 1 1 1 2 2 4 4 4 4
#
# $D
# [1] 4 4 4 4 4 4 4 4 4 4
Convert back to dataframe
head(cbind(dat, outcomeNew = unlist(l, use.names = FALSE)), 10)
# convert back to dataframe
# idp outcome outcomeNew
# 1 A 1 1
# 2 A 1 1
# 3 A 1 1
# 4 A 2 2
# 5 A 3 3
# 6 A 4 4
# 7 A 3 4
# 8 A 4 4
# 9 A 2 4
# 10 A 2 4