I have a dataset which has studyid, year, and two flags: incident and prevalent. I wish the prevalent variable to be TRUE (1) for all years after the incident flag is true (and the incident variable can be true only once). case_when and lag seem the perfect combination, but if incident is set to 1 in year N, prevalent is set to TRUE only in N+1, and reverse to 0 in N+1. This was not the expected behaviour.
Here is sample code:
library(tidyverse)
# make a fake dataset
testdat <- tribble(
~studyid, ~datestring, ~incident,
"1", "2000-01-01", 0,
"1", "2001-01-01", 1,
"1", "2002-01-01", 0,
"1", "2003-01-01", 0,
"2", "2003-01-01", 0,
"2", "2004-01-01", 1,
"2", "2005-01-01", 0,
"2", "2006-01-01", 0
) %>% mutate(
prevalent = 0,
date = lubridate::ymd(datestring)
) %>% group_by(studyid) %>%
arrange(studyid, date) %>%
mutate(prevalent = case_when(
#logic is, if prevalent in year N-1, the prevalent in year N
# if incident in year N-1, then prevalent in year N
# otherwise not prevalent (because never incident)
dplyr::lag(prevalent, 1L)==1 ~1,
dplyr::lag(incident, 1L)==1 ~1,
TRUE ~ 0
) #close case_when
) #close mutate
testdat
Output is:
# A tibble: 8 x 5
# Groups: studyid [2]
studyid datestring incident prevalent date
<chr> <chr> <dbl> <dbl> <date>
1 1 2000-01-01 0 0 2000-01-01
2 1 2001-01-01 1 0 2001-01-01
3 1 2002-01-01 0 1 2002-01-01
4 1 2003-01-01 0 0 2003-01-01
5 2 2003-01-01 0 0 2003-01-01
6 2 2004-01-01 1 0 2004-01-01
7 2 2005-01-01 0 1 2005-01-01
8 2 2006-01-01 0 0 2006-01-01
>
Desired output is:
studyid=1, year=2003 prevalent ==1 (not 0)
studyid=2, year=2006 prevalent ==1 (not 0)
I suspect this has to do with how case_when is interacting with dplyr::lag. How can I improve the logic/syntax to obtain the needed results?
Many thanks,
You are looking for something like the last observation carried forward e.g zoo::na.locf
or tidyr::fill
, but I'll use something simple like:
library(dplyr)
testdat %>%
mutate(date = lubridate::ymd(datestring)) %>% group_by(studyid) %>%
arrange(studyid, date) %>% mutate(prevalent=cumsum(lag(incident,default = 0)==1))
# A tibble: 8 x 5
# Groups: studyid [2]
studyid datestring incident date prevalent
<chr> <chr> <dbl> <date> <int>
1 1 2000-01-01 0 2000-01-01 0
2 1 2001-01-01 1 2001-01-01 0
3 1 2002-01-01 0 2002-01-01 1
4 1 2003-01-01 0 2003-01-01 1
5 2 2003-01-01 0 2003-01-01 0
6 2 2004-01-01 1 2004-01-01 0
7 2 2005-01-01 0 2005-01-01 1
8 2 2006-01-01 0 2006-01-01 1