Search code examples
rdatelag

Count future events with a time cut-off


I have a dataframe with columns 'Person', 'NoShow', and 'date'.

Code to create the dataframe:

df <- data.frame(Person = c('334','334','334','334','334','334','334','334','334'), 
                 NoShow = c(1,1,1,0,1,1,0,1,0), 
                 date = c('2014-01-01','2016-06-01','2017-01-01','2018-05-01','2021-01-01','2021-06-03','2022-04-14','2022-04-14','2023-05-03'),
                 date_num = c(20140101, 20160601, 20170101, 20180501, 20210101, 20210603,20220414,20220414,20230503)
)

I want to create two new variables: One variable to show whether or not there was a no-show within two years of the current date, and one variable that counts the number of no-shows within two years of the current date.

The code below successfully creates the first variable 'reNoShow_2y' (whether or not there was a no-show within two years of the current date):

df$date_num_rev <- df$date_num * (-1)

df <- df[order(df$date_num_rev),]


### create reNoShow (0,1) and reNoShow_f (# of re NoShows)
library(tidyverse)

df <- df %>%
  group_by(Person) %>%
  mutate(reNoShow = lag(as.numeric(cumsum(NoShow) > 0)),
         reNoShow_f = lag(cumsum(NoShow), default = 0)*(date != lag(date)),
         reNoShow_f = case_when(
           is.na(reNoShow_f) ~ 0,
           reNoShow_f == 0 ~ lag(reNoShow_f),
           TRUE ~ reNoShow_f
         ))


### to prepare for the next step, convert reNoShow_f==NA to reNoShow==0
df$reNoShow_f[is.na(df$reNoShow_f)] <- 0

### create NoShowiable that captures the date of the most recent NoShow
df <- df %>%
  group_by(Person) %>%
  mutate(nextNoShow=ifelse(c(FALSE, diff(reNoShow_f))!=0, lag(date),NA)) %>%
  fill(nextNoShow)



df$daysto_nextNoShow <- as.numeric(difftime(df$nextNoShow, df$date, units = c('days')))

df$reNoShow_2y <- ifelse(df$reNoShow==1 & df$daysto_nextNoShow >= 730 & df$date!=df$nextNoShow,0,1)

View(df)

I don't know how to create the second variable (the number of no-shows within two years of the current date). From rows 1-9, the values of that variable should be: 0,0,0,1,2,0,0,1,0


Solution

  • This should calculate the number of no-shows (summing the NoShow column) for the period from the next day to two years' hence.

    library(lubridate); library(dplyr); library(slider)
    df %>%
      mutate(date = ymd(date)) %>%
      group_by(Person) %>%
      arrange(Person, date) %>% # slider unfortunately requires dates be sorted
      mutate(no_shows_next2yr = slide_index_dbl(
        NoShow, date, sum, .before = days(-1), .after = years(2)
      ))
    
      Person NoShow date       date_num no_shows_next2yr
      <chr>   <dbl> <date>        <dbl>            <dbl>
    1 334         1 2014-01-01 20140101                0
    2 334         1 2016-06-01 20160601                1
    3 334         1 2017-01-01 20170101                0
    4 334         0 2018-05-01 20180501                0
    5 334         1 2021-01-01 20210101                2
    6 334         1 2021-06-03 20210603                1
    7 334         0 2022-04-14 20220414                0
    8 334         1 2022-04-14 20220414                0
    9 334         0 2023-05-03 20230503                0