I have a dataset like this. The date_e was accurate for status= "1". I want to simulate date_e based on age. Therefore, new_date_e will be changed for status="0", will be same for status="1". Also, status=1 has higher risk, so df= date_e-age should be in average shorter for status="1"than "0".
age date_e status id
1 1950-10-21 2008-11-02 0 1
2 1941-02-11 2006-08-28 0 2
3 1940-01-20 2000-05-25 0 3
4 1957-11-05 2008-03-28 1 4
5 1946-09-15 2004-03-10 0 5
and the data is :
library(dplyr)
set.seed(1)
age <- sample(seq(as.Date('1930-01-01'), as.Date('1970-01-01'), by="day"), 1000)
date1 <- sample(seq(as.Date('2000-01-01'), as.Date('2010-01-01'), by="day"), 1000)
status <- sample(c(0, 1), size = 1000, replace = TRUE, prob = c(0.8, 0.2))
df <- data.frame(age, date1, status)
df <- df %>% mutate(id = row_number())
I guess what you are wanting to simulate is the effect of status
on longevity (i.e. the time difference between date1
and age
in your reproducible example). At the moment, status
has no effect on longevity:
library(ggplot2)
df %>%
ggplot(aes(x = factor(status),
y = as.numeric(difftime(date1, age, unit = 'w'))/52,
fill = factor(status))) +
geom_boxplot(width = 0.6) +
guides(fill = guide_none()) +
labs(x = 'Status', y = 'Age (years)')
Effectively, what you need to do is to subtract a random amount of time from the date1
column where status == 1
. To do this, you can take advantage of the fact that dates are stored as integers 'under the hood' in R, and the fact that you can multiply a random draw by the status
column, since those with status == 0
will thereby always have 0 subtracted.
So the answer is that you only need do:
df$date1 <- df$date1 - df$status * round(rnorm(nrow(df), 3650, 500))
Which will remove on average 10 years from those with status == 1
but leave those with status == 0
as-is:
df %>%
ggplot(aes(x = factor(status),
y = as.numeric(difftime(date1, age, unit = 'w'))/52,
fill = factor(status))) +
geom_boxplot(width = 0.6) +
guides(fill = guide_none()) +
labs(x = 'Status', y = 'Age (years)')