Search code examples
rdataframerandomsimulate

Simulating a representative dataset in R


Suppose I have the following data frame:

sectoral_data <- data.frame(sector=c("a","b","c","d"),share=c(0.5,0.3,0.1,0.1),avg_wage=c(400,600,800,1000))

where "share" is the employment share in each sector. I want simulate (I guess that's the right word) the following data frame that would represent a sample of ten individuals from that economy:

personal_data <- data.frame(individual=c(1:10),
                          wage=c(rep.int(400,5),rep.int(600,3),rep.int(800,1), rep.int(1000,1)),
                          sector=c(rep("a",5),rep("b",3), rep("c",1), rep("d",1))
                          )

Any idea of an efficient way to do this and/or if there is a built in feature?


Solution

  • You can use sample:

    n <- 10
    
    with(sectoral_data,
      data.frame(
        individual = seq_len(n),
        wage = sample(avg_wage, size = n, replace = TRUE, prob = share),
        sector = sample(sector, size = n, replace = TRUE, prob = share)
      ))
    #   individual wage sector
    #1           1  400      c
    #2           2  600      c
    #3           3  800      a
    #4           4  800      b
    #5           5  400      b
    #6           6  400      a
    #7           7  400      b
    #8           8  600      c
    #9           9  400      a
    #10         10  400      c