Search code examples
rpurrrintervalslubridate

How to most efficiently calculate daily enrollment from an entry and exit date in R?


The following code works, but it seems highly inefficient. Is there a more straight forward way to calculate a daily enrollment by site from an entry and exit date.

Data:

df <- data.frame(
id <- seq_along(1:10),
entry_date <- seq(as.Date("2022-01-01"), as.Date("2022-01-10"), by = 1),
exit_date <- seq(as.Date("2022-05-20"), as.Date("2022-05-30"), by = 1),
site <- rep(c("Loc1", "Loc2"), times= 5)
) %>% 
  set_names("id", "entry_date", "exit_date", "site") %>% 
  mutate(exit_date = as.character(exit_date))

df[8:10, 3] <- rep("NA", times = 3)

Create a ytd sequence:

date <- seq(as.Date("2022-01-01"), today(), by = 1) 

And iterate the function below over it with:

enrolled_ytd <- map_dfr(date, ~ daily_enrollment_fun(.), .id = "date") 

Function:

daily_enrollment_fun <- function(date){
  df %>% 
    select(id, entry_date, exit_date, site) %>% 
    drop_na(site) %>%
    mutate(enrolled_int = interval((entry_date), (exit_date))) %>% 
    distinct(id, .keep_all = T) %>% 
    mutate(enrolled = date %within% enrolled_int) 
}

The output is a dataframe with a TRUE/FALSE enrollment for every id and every day in the sequence of dates. To clean the data, I run the function:

daily_enrollment_clean_fun <- function(date, origin = "2022-01-01") {
  
  enrolled_ytd  %>% 
    mutate(date = as.numeric(date) -1,
           date = as.Date(date, origin = origin)) %>%
    group_by(date, site, enrolled) %>%
    count() %>%
    filter(enrolled == "TRUE") %>%
    ungroup() %>%
    select(-enrolled) %>% 
    arrange(desc(date))
}

Solution

  • Does this give you what you're looking for? I expect this should be more performant since the calculation here is vectorized once we get the stream of entries and exits into a longer form.

    library(tidyverse)
    df %>%
      pivot_longer(entry_date:exit_date) %>%
      filter(!is.na(value)) %>%
      mutate(change = if_else(name == "entry_date", 1, -1)) %>%
      group_by(site) %>%
      arrange(site, value) %>%
      mutate(enrollment = cumsum(change))  %>%
      complete(value = seq.Date(min(value, na.rm = TRUE), 
                                      max(value, na.rm = TRUE), 
                                      by = "day")) %>%
      fill(enrollment) %>%
      ungroup() 
    

    Result into ggplot with %>% ggplot(aes(value, enrollment, color = site)) + geom_point()

    enter image description here


    Modified input (to keep both date columns as dates)

    library(tidyverse)
    df <- data.frame(
      id <- seq_along(1:10),
      entry_date <- seq(as.Date("2022-01-01"), as.Date("2022-01-10"), by = 1),
      exit_date <- seq(as.Date("2022-05-21"), as.Date("2022-05-30"), by = 1),
      site <- rep(c("Loc1", "Loc2"), times= 5)
    ) %>% 
      set_names("id", "entry_date", "exit_date", "site")
    
    df[8:10, 3] <- rep(NA_real_, times = 3)