Search code examples
rdistributionrandompopulation

Generate population data with specific distribution in R


I have a distribution of ages in a population.

For instance, you can imagine something like this:

Ages <24: 15%

Ages 25-49: 40%

Ages 50-60: 20%

Ages >60: 25%

I don't have the mean and standard deviation for each stratum/age group in the data. I am trying to generate a sample population of 1000 individuals where the generated data matches the distribution of ages shown above.


Solution

  • Let's put this data in a more friendly format:

    (dat <- data.frame(min=c(0, 25, 50, 60), max=c(25, 50, 60, 100), prop=c(0.15, 0.40, 0.20, 0.25)))
    #   min max prop
    # 1   0  25 0.15
    # 2  25  50 0.40
    # 3  50  60 0.20
    # 4  60 100 0.25
    

    We can easily sample 1000 rows of the table using the sample function:

    set.seed(144)  # For reproducibility
    rows <- sample(nrow(dat), 1000, replace=TRUE, prob=dat$prop)
    table(rows)
    # rows
    #   1   2   3   4 
    # 139 425 198 238 
    

    To sample actual ages you will need to define a distribution over the ages represented by each row. A simple one would be uniformly distributed ages:

    age <- round(dat$min[rows] + runif(1000) * (dat$max[rows] - dat$min[rows]))
    table(age)
    # age
    #   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27 
    #   2   5   5   3   7   7   9   6   7   6   1   7   7   5   5   6   2   4   6   7   4  11   8   2   3  10  11  13 
    #  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55 
    #  19  16  20  16  18  21  16  19  14  20  15  13  18  15  24  20  16  16  29  16  11  12  18  17  17  26  27  21 
    #  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83 
    #  17  26  11  13  20   3   8   9   6   4   3   3   5   4   3   3   5   8   3  13   5   6   4   7   9   9   6   4 
    #  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
    #   5   5   9   9   5   6   8   9   5   4   6   5   9   6   8   4   1 
    

    Of course, if uniformly sampling the ages in each range is inappropriate in your application, then you would need to pick some other function to get ages from buckets.