I have a short question about computing the number of participants by timepoints. Consider the sample long format data:
data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,3),survey_date=c("01/12/2020","02/12/2020","03/12/2020","04/12/2020","01/31/2020","03/12/2020","04/05/2020","02/12/2020","04/12/2020","05/12/2020","06/12/2020"),last_seen=c('05/25/2020','05/25/2020','05/25/2020','05/25/2020','04/06/2020','04/06/2020','04/06/2020','','','',''))
Survey date indicates when the survey took place. \Some subjects in the data became lost to follow-up, we do have that last-seen date recorded in the data for those subjects (it appears in all rows for that subject).
I would like to add a column 'num_N' to the existing data indicating the number of participants who were in the study at the timepoint survey_date.
For example, on 01/12/2020, only subject id=1 was in the study, so num_N for that row would be 1.
On 06/12/2020, only subject id=3 was in the study, so num_N for that row would be 1.
Any help would be appreciated. Thanks!
This is a good case for iv_count_between
from ivs
:
library(dplyr)
library(ivs)
#Convert your dates to date format
data <- data |>
mutate(across(-id, lubridate::mdy))
#Create intervals
data_ivs <-
data |>
summarise(min = min(survey_date),
max = max(survey_date, last_seen, na.rm = TRUE),
.by = id) |>
mutate(ivs = iv(min, max + 1))
#Count intervals with `iv_count_between`:
data |>
mutate(num_N = iv_count_between(survey_date, data_ivs$ivs))
# id survey_date last_seen num_N
# 1 1 2020-01-12 2020-05-25 1
# 2 1 2020-02-12 2020-05-25 3
# 3 1 2020-03-12 2020-05-25 3
# 4 1 2020-04-12 2020-05-25 2
# 5 2 2020-01-31 2020-04-06 2
# 6 2 2020-03-12 2020-04-06 3
# 7 2 2020-04-05 2020-04-06 3
# 8 3 2020-02-12 <NA> 3
# 9 3 2020-04-12 <NA> 2
# 10 3 2020-05-12 <NA> 2
# 11 3 2020-06-12 <NA> 1