I'm trying to create a dataframe with two variables. The first one is binary physical activity status (active/non-active), the second one is presence of cardiovascular disease (also binary variable - 1/0).
The problem is that the presence of disease must occur with a given probability. It is 0.5 probability of disease in people with active status and 0.6 probability in people who doesn't physically active.
Please, really need help. I don't have any ideas how to do that. Looking into the sample and rbinom functions but to no avail.
You can do this in base R with an ifelse
statement and the probs
argument in sample. Below is an example with 10,000 observations (to help check that it works with prop.table
)
set.seed(123)
n <- 1e4
df <- data.frame(physical_activity = sample(c("Active", "Non-Active"), n, replace = TRUE))
df$cardio_disease <- ifelse(df$physical_activity %in% "Active",
sample(0:1, n, replace = TRUE), # P(disease|active) = 0.5
sample(0:1, n, replace = TRUE, prob = c(0.4, 0.6))) # P(Disease|non-active) = 0.6
Check results:
prop.table(table(df), margin = 1)
#physical_activity 0 1
# Active 0.5062787 0.4937213
# Non-Active 0.4021674 0.5978326