Search code examples
rvariablesrandomstatisticssimulation

Generating dataframe with conditional random variables


I'm trying to create a dataframe with two variables. The first one is binary physical activity status (active/non-active), the second one is presence of cardiovascular disease (also binary variable - 1/0).

The problem is that the presence of disease must occur with a given probability. It is 0.5 probability of disease in people with active status and 0.6 probability in people who doesn't physically active.

Please, really need help. I don't have any ideas how to do that. Looking into the sample and rbinom functions but to no avail.


Solution

  • You can do this in base R with an ifelse statement and the probs argument in sample. Below is an example with 10,000 observations (to help check that it works with prop.table)

    set.seed(123)
    n <- 1e4
    df <- data.frame(physical_activity = sample(c("Active", "Non-Active"), n, replace = TRUE))
    
    df$cardio_disease <- ifelse(df$physical_activity %in% "Active", 
                                sample(0:1, n, replace = TRUE), # P(disease|active) = 0.5
                                sample(0:1, n, replace = TRUE, prob = c(0.4, 0.6))) # P(Disease|non-active) = 0.6
    

    Check results:

    prop.table(table(df), margin = 1)
    
    #physical_activity         0         1
    #       Active     0.5062787 0.4937213
    #       Non-Active 0.4021674 0.5978326