r dataframe data-cleaning data-transform

Modifying the column of a dataframe based in some rules

Suppose that I have a data frame as follows:

idp<-sort(rep(c("A","B","C","D"),10))
a1<-c(1,1,1,2,3,4,3,4,2,2)
a2<-c(3,3,NA,NA,4,1,2,3,1,1)
a3<-c(NA,NA,1,1,2,2,4,NA,NA,1)
a4<-c(4,3,2,1,NA,NA,NA,1,2,3)
dat<-data.frame(idp,outcome=c(a1,a2,a3,a4))

In dat, the idp is an identification of a person. Each value of a1, ..., a4 represents a status, where 4 indicates "dying"/death.

For any idp, if a 4 occurs all following values need to be set to 4. If NA occurs, we assume that the immediate previous state, which is not NA, should replace the NA. Finally, if the sequence starts with NA, then we should choose the immediate non-missing first state appearing in the vector.

Solution

Split based on idp, fill NAs, then find 4 and fill with 4 if any:

#split per idp and loop
l <- lapply(split(dat$outcome, dat$idp), function(i){
  # fill NA
  out <- zoo::na.locf(zoo::na.locf(i, na.rm = FALSE),
                      na.rm = FALSE, fromLast = TRUE)
  # fill 4 if any
  ix4 <- min(which(out == 4))
  if(length(ix4) > 0){ out[ ix4:length(out) ] <- 4 }
  
  out
  })


l
# $A
# [1] 1 1 1 2 3 4 4 4 4 4
# 
# $B
# [1] 3 3 3 3 4 4 4 4 4 4
# 
# $C
# [1] 1 1 1 1 2 2 4 4 4 4
# 
# $D
# [1] 4 4 4 4 4 4 4 4 4 4

Convert back to dataframe

head(cbind(dat, outcomeNew = unlist(l, use.names = FALSE)), 10)
# convert back to dataframe
#    idp outcome outcomeNew
# 1    A       1          1
# 2    A       1          1
# 3    A       1          1
# 4    A       2          2
# 5    A       3          3
# 6    A       4          4
# 7    A       3          4
# 8    A       4          4
# 9    A       2          4
# 10   A       2          4