Search code examples
rdataframetidyversesimulationsample-data

How to easily generate/simulate example data with different groups for modelling


How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?

For example, what would be the simplest way for generating such data?

  • groups: two groups: A, B
  • sex: different sex distributions: A 30%, B 70%
  • age: different mean ages: A 50, B 70

PS! Tidyverse solutions are especially welcome.

My best try so far is still quite a lot of code:

n=100
d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

enter image description here

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

enter image description here


Solution

  • There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:

    set.seed(69) # Makes samples reproducible
    
    df <- data.frame(groups = rep(c("A", "B"), each = 100),
                     sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                             sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                     age = c(runif(100, 25, 75), runif(100, 50, 90)))
    

    And we can use the tidyverse to show it does what was expected:

    library(dplyr)
    
    df %>% 
      group_by(groups) %>% 
      summarise(age = mean(age),
                percent_male = length(which(sex == "M")))
    #> # A tibble: 2 x 3
    #>   groups   age percent_male
    #>   <chr>  <dbl>        <int>
    #> 1 A       49.4           29
    #> 2 B       71.0           50