I have data of format
set.seed(40)
subject <- sample(c("mike", "john", "steve"), 20, replace = TRUE)
test1 <- sample(c("pos", "neg", "pos", "neg", "NA"), 20, replace = TRUE)
testdate <- Sys.Date() + sample(-1000:1000, 20, replace = FALSE)
mydf <- data.frame(subject, testdate, test1)
mydf$status <- "unknown"
For each subject, I would like to modify the values of status such that: a) it remains unknown until the earliest (by test date) pos or neg result for test1 is obtained; b) when the first pos test1 result is obtained, status on that date and afterwards becomes “in” regardless of any subsequent test1 values; c) if a neg result occurs for test1 before any positive result, status becomes “out” for that and subsequent dates until any positive test1 result is obtained. Open to all solutions. I am experimenting with dplyr and would be especially interested in a dplyr-based solution.
The output would be
subject testdate test1 status
john 2014-11-20 neg negative
john 2015-07-29 neg negative
john 2015-11-10 neg negative
john 2017-04-08 neg negative
john 2018-09-18 NA negative
mike 2014-09-01 pos positive
mike 2014-10-14 neg positive
mike 2015-03-22 neg positive
mike 2016-09-15 pos positive
mike 2017-08-18 neg positive
mike 2017-12-20 pos positive
mike 2018-09-06 NA positive
mike 2019-09-02 neg positive
steve 2015-06-21 neg negative
steve 2016-01-03 pos positive
steve 2016-03-12 neg positive
steve 2017-06-26 neg positive
steve 2017-12-02 neg positive
steve 2018-12-20 pos positive
steve 2019-06-20 pos positive
This is relatively straightforward with group_by
and mutate
.
First, modify the test results to be a factor. This allows them to be "ranked" so that we can tell what the "highest" result has been. Because you want the results to be "Missing", "Negative", "Positive", set the levels in that order:
mydf$test1 <-
factor(mydf$test1
, levels = c("NA", "neg", "pos")
, ordered = TRUE)
Next, create a vector of the labels you want to use when each of the above has happened. In the text, you said you wanted in/out, but the desired output uses Negative/Positive. If you want to change the labels, it should be easy to do that here:
statusLevels <-
c("Unknown", "Negative", "Positive")
Finally, we can apply this to the data. First, sort by date to ensure that the test results are checked in the right order (I am also sorting by subject to make the results clear and match your request). Then, group by the the subject. Finally, mutate
to create the column you want. Here, it checks for the "biggest" test value so far (hence why we converted to a factor) and gives us the status level that matches:
mydf %>%
arrange(subject, testdate) %>%
group_by(subject) %>%
mutate(status = statusLevels[cummax(as.numeric(test1))])
Returns:
subject testdate test1 status
<chr> <date> <ord> <chr>
1 john 2014-11-21 neg Negative
2 john 2015-07-30 neg Negative
3 john 2015-11-11 neg Negative
4 john 2017-04-09 neg Negative
5 john 2018-09-19 NA Negative
6 mike 2014-09-02 pos Positive
7 mike 2014-10-15 neg Positive
8 mike 2015-03-23 neg Positive
9 mike 2016-09-16 pos Positive
10 mike 2017-08-19 neg Positive
11 mike 2017-12-21 pos Positive
12 mike 2018-09-07 NA Positive
13 mike 2019-09-03 neg Positive
14 steve 2015-06-22 neg Negative
15 steve 2016-01-04 pos Positive
16 steve 2016-03-13 neg Positive
17 steve 2017-06-27 neg Positive
18 steve 2017-12-03 neg Positive
19 steve 2018-12-21 pos Positive
20 steve 2019-06-21 pos Positive