I have a dataframe with columns 'Person', 'NoShow', and 'date'.
Code to create the dataframe:
df <- data.frame(Person = c('334','334','334','334','334','334','334','334','334'),
NoShow = c(1,1,1,0,1,1,0,1,0),
date = c('2014-01-01','2016-06-01','2017-01-01','2018-05-01','2021-01-01','2021-06-03','2022-04-14','2022-04-14','2023-05-03'),
date_num = c(20140101, 20160601, 20170101, 20180501, 20210101, 20210603,20220414,20220414,20230503)
)
I want to create two new variables: One variable to show whether or not there was a no-show within two years of the current date, and one variable that counts the number of no-shows within two years of the current date.
The code below successfully creates the first variable 'reNoShow_2y' (whether or not there was a no-show within two years of the current date):
df$date_num_rev <- df$date_num * (-1)
df <- df[order(df$date_num_rev),]
### create reNoShow (0,1) and reNoShow_f (# of re NoShows)
library(tidyverse)
df <- df %>%
group_by(Person) %>%
mutate(reNoShow = lag(as.numeric(cumsum(NoShow) > 0)),
reNoShow_f = lag(cumsum(NoShow), default = 0)*(date != lag(date)),
reNoShow_f = case_when(
is.na(reNoShow_f) ~ 0,
reNoShow_f == 0 ~ lag(reNoShow_f),
TRUE ~ reNoShow_f
))
### to prepare for the next step, convert reNoShow_f==NA to reNoShow==0
df$reNoShow_f[is.na(df$reNoShow_f)] <- 0
### create NoShowiable that captures the date of the most recent NoShow
df <- df %>%
group_by(Person) %>%
mutate(nextNoShow=ifelse(c(FALSE, diff(reNoShow_f))!=0, lag(date),NA)) %>%
fill(nextNoShow)
df$daysto_nextNoShow <- as.numeric(difftime(df$nextNoShow, df$date, units = c('days')))
df$reNoShow_2y <- ifelse(df$reNoShow==1 & df$daysto_nextNoShow >= 730 & df$date!=df$nextNoShow,0,1)
View(df)
I don't know how to create the second variable (the number of no-shows within two years of the current date). From rows 1-9, the values of that variable should be: 0,0,0,1,2,0,0,1,0
This should calculate the number of no-shows (summing the NoShow
column) for the period from the next day to two years' hence.
library(lubridate); library(dplyr); library(slider)
df %>%
mutate(date = ymd(date)) %>%
group_by(Person) %>%
arrange(Person, date) %>% # slider unfortunately requires dates be sorted
mutate(no_shows_next2yr = slide_index_dbl(
NoShow, date, sum, .before = days(-1), .after = years(2)
))
Person NoShow date date_num no_shows_next2yr
<chr> <dbl> <date> <dbl> <dbl>
1 334 1 2014-01-01 20140101 0
2 334 1 2016-06-01 20160601 1
3 334 1 2017-01-01 20170101 0
4 334 0 2018-05-01 20180501 0
5 334 1 2021-01-01 20210101 2
6 334 1 2021-06-03 20210603 1
7 334 0 2022-04-14 20220414 0
8 334 1 2022-04-14 20220414 0
9 334 0 2023-05-03 20230503 0