Search code examples
rrandomdplyrsimulationmontecarlo

Monte Carlo Simulation of a date based on another date


I have a dataset like this. The date_e was accurate for status= "1". I want to simulate date_e based on age. Therefore, new_date_e will be changed for status="0", will be same for status="1". Also, status=1 has higher risk, so df= date_e-age should be in average shorter for status="1"than "0".

           age      date_e  status  id
1   1950-10-21 2008-11-02      0   1
2   1941-02-11 2006-08-28      0   2
3   1940-01-20 2000-05-25      0   3
4   1957-11-05 2008-03-28      1   4
5   1946-09-15 2004-03-10      0   5

and the data is :

library(dplyr)

set.seed(1)

age <- sample(seq(as.Date('1930-01-01'), as.Date('1970-01-01'), by="day"), 1000)
date1 <- sample(seq(as.Date('2000-01-01'), as.Date('2010-01-01'), by="day"), 1000)
status <- sample(c(0, 1), size = 1000, replace = TRUE, prob = c(0.8, 0.2))
df <- data.frame(age, date1, status)
df <- df %>% mutate(id = row_number())

Solution

  • I guess what you are wanting to simulate is the effect of status on longevity (i.e. the time difference between date1 and age in your reproducible example). At the moment, status has no effect on longevity:

    library(ggplot2)
    
    df %>%
    ggplot(aes(x    = factor(status), 
               y    = as.numeric(difftime(date1, age, unit = 'w'))/52,
               fill = factor(status))) + 
      geom_boxplot(width = 0.6) +
      guides(fill = guide_none()) +
      labs(x = 'Status', y = 'Age (years)')
    

    enter image description here

    Effectively, what you need to do is to subtract a random amount of time from the date1 column where status == 1. To do this, you can take advantage of the fact that dates are stored as integers 'under the hood' in R, and the fact that you can multiply a random draw by the status column, since those with status == 0 will thereby always have 0 subtracted.

    So the answer is that you only need do:

    df$date1 <- df$date1 - df$status * round(rnorm(nrow(df), 3650, 500))
    

    Which will remove on average 10 years from those with status == 1 but leave those with status == 0 as-is:

    df %>% 
      ggplot(aes(x    = factor(status), 
                 y    = as.numeric(difftime(date1, age, unit = 'w'))/52,
                 fill = factor(status))) + 
      geom_boxplot(width = 0.6) +
      guides(fill = guide_none()) +
      labs(x = 'Status', y = 'Age (years)')
    

    enter image description here