Search code examples
rdataframedata-cleaningdata-transform

Modifying the column of a dataframe based in some rules


Suppose that I have a data frame as follows:

idp<-sort(rep(c("A","B","C","D"),10))
a1<-c(1,1,1,2,3,4,3,4,2,2)
a2<-c(3,3,NA,NA,4,1,2,3,1,1)
a3<-c(NA,NA,1,1,2,2,4,NA,NA,1)
a4<-c(4,3,2,1,NA,NA,NA,1,2,3)
dat<-data.frame(idp,outcome=c(a1,a2,a3,a4))

In dat, the idp is an identification of a person. Each value of a1, ..., a4 represents a status, where 4 indicates "dying"/death.

For any idp, if a 4 occurs all following values need to be set to 4. If NA occurs, we assume that the immediate previous state, which is not NA, should replace the NA. Finally, if the sequence starts with NA, then we should choose the immediate non-missing first state appearing in the vector.


Solution

  • Split based on idp, fill NAs, then find 4 and fill with 4 if any:

    #split per idp and loop
    l <- lapply(split(dat$outcome, dat$idp), function(i){
      # fill NA
      out <- zoo::na.locf(zoo::na.locf(i, na.rm = FALSE),
                          na.rm = FALSE, fromLast = TRUE)
      # fill 4 if any
      ix4 <- min(which(out == 4))
      if(length(ix4) > 0){ out[ ix4:length(out) ] <- 4 }
      
      out
      })
    
    
    l
    # $A
    # [1] 1 1 1 2 3 4 4 4 4 4
    # 
    # $B
    # [1] 3 3 3 3 4 4 4 4 4 4
    # 
    # $C
    # [1] 1 1 1 1 2 2 4 4 4 4
    # 
    # $D
    # [1] 4 4 4 4 4 4 4 4 4 4
    

    Convert back to dataframe

    head(cbind(dat, outcomeNew = unlist(l, use.names = FALSE)), 10)
    # convert back to dataframe
    #    idp outcome outcomeNew
    # 1    A       1          1
    # 2    A       1          1
    # 3    A       1          1
    # 4    A       2          2
    # 5    A       3          3
    # 6    A       4          4
    # 7    A       3          4
    # 8    A       4          4
    # 9    A       2          4
    # 10   A       2          4