Search code examples
dplyrsplit-apply-combine

Assign column value (split apply combine) based on another column


I have data of format

set.seed(40)
subject <- sample(c("mike", "john", "steve"), 20, replace = TRUE)
test1 <- sample(c("pos", "neg", "pos", "neg", "NA"), 20, replace = TRUE)
testdate <- Sys.Date() + sample(-1000:1000, 20, replace = FALSE)
mydf <- data.frame(subject, testdate, test1)
mydf$status <- "unknown"

For each subject, I would like to modify the values of status such that: a) it remains unknown until the earliest (by test date) pos or neg result for test1 is obtained; b) when the first pos test1 result is obtained, status on that date and afterwards becomes “in” regardless of any subsequent test1 values; c) if a neg result occurs for test1 before any positive result, status becomes “out” for that and subsequent dates until any positive test1 result is obtained. Open to all solutions. I am experimenting with dplyr and would be especially interested in a dplyr-based solution.

The output would be

subject testdate    test1   status
john    2014-11-20  neg negative
john    2015-07-29  neg negative
john    2015-11-10  neg negative
john    2017-04-08  neg negative
john    2018-09-18  NA  negative
mike    2014-09-01  pos positive
mike    2014-10-14  neg positive
mike    2015-03-22  neg positive
mike    2016-09-15  pos positive
mike    2017-08-18  neg positive
mike    2017-12-20  pos positive
mike    2018-09-06  NA  positive
mike    2019-09-02  neg positive
steve   2015-06-21  neg negative
steve   2016-01-03  pos positive
steve   2016-03-12  neg positive
steve   2017-06-26  neg positive
steve   2017-12-02  neg positive
steve   2018-12-20  pos positive
steve   2019-06-20  pos positive

Solution

  • This is relatively straightforward with group_by and mutate.

    First, modify the test results to be a factor. This allows them to be "ranked" so that we can tell what the "highest" result has been. Because you want the results to be "Missing", "Negative", "Positive", set the levels in that order:

    mydf$test1 <-
      factor(mydf$test1
             , levels = c("NA", "neg", "pos")
             , ordered = TRUE)
    

    Next, create a vector of the labels you want to use when each of the above has happened. In the text, you said you wanted in/out, but the desired output uses Negative/Positive. If you want to change the labels, it should be easy to do that here:

    statusLevels <-
      c("Unknown", "Negative", "Positive")
    

    Finally, we can apply this to the data. First, sort by date to ensure that the test results are checked in the right order (I am also sorting by subject to make the results clear and match your request). Then, group by the the subject. Finally, mutate to create the column you want. Here, it checks for the "biggest" test value so far (hence why we converted to a factor) and gives us the status level that matches:

    mydf %>%
      arrange(subject, testdate) %>%
      group_by(subject) %>%
      mutate(status = statusLevels[cummax(as.numeric(test1))])
    

    Returns:

       subject   testdate test1   status
         <chr>     <date> <ord>    <chr>
    1     john 2014-11-21   neg Negative
    2     john 2015-07-30   neg Negative
    3     john 2015-11-11   neg Negative
    4     john 2017-04-09   neg Negative
    5     john 2018-09-19    NA Negative
    6     mike 2014-09-02   pos Positive
    7     mike 2014-10-15   neg Positive
    8     mike 2015-03-23   neg Positive
    9     mike 2016-09-16   pos Positive
    10    mike 2017-08-19   neg Positive
    11    mike 2017-12-21   pos Positive
    12    mike 2018-09-07    NA Positive
    13    mike 2019-09-03   neg Positive
    14   steve 2015-06-22   neg Negative
    15   steve 2016-01-04   pos Positive
    16   steve 2016-03-13   neg Positive
    17   steve 2017-06-27   neg Positive
    18   steve 2017-12-03   neg Positive
    19   steve 2018-12-21   pos Positive
    20   steve 2019-06-21   pos Positive