Search code examples
rmodelingrounding-errorprobability-distributiondemographics

R: Allocating N discrete elements between population cohorts by probability


I want to model demographic changes within a population, using historical data going from 1986 until the present. My data covers a population divided into age cohorts of a single year: 0,1,2 ... 106+.

Every age cohort, has a corresponding probability of dying - based on national statistics. Multiplying the number of men aged 50 with the probability of dying would yield

276 x 0.004813 = 1.328

Although the expectation is 1.328, the number of 50 year old men dying in 1986 has to be an integer, and would likely be somewhere close to the expectation. I could simply multiply and round these numbers up or down, but doing this across the whole population over many years would generate significant rounding errors.

Since I'm using hitorical data, I know the number of total deaths within the entire male population in 1986: 259. I also know the number of people in each cohort, and the corresponding estimated probability of dying.

Consider the following data from 1986:

age <- (0:106)
pop <- c(313,330,266,347,289,297,282,300,287,329,345,347,397,390,426,425,493,464,446,428,441,459,406,415,381,410,390,388,390,401,382,385,346,355,401,396,377,424,428,487,424,447,407,328,298,324,315,299,297,259,276,258,277,280,283,263,253,253,269,292,267,247,244,251,227,253,206,213,193,193,165,162,144,173,128,120,102,106,91,85,67,67,54,47,36,39,26,22,24,9,12,9,5,2,1,0,1,1,0,0,0,0,0,0,0,0,0)
mortality_p <- c(0.008395, 0.001153, 0.000464, 0.000229, 0.000304, 0.000342, 0.000263, 0.000337, 0.000414,
  0.000037, 0.000212, 0.000168, 0.000353, 0.000246, 0.000389, 0.000506, 0.000878, 0.000890, 0.001480, 0.001779, 0.001226, 0.001476, 0.001033, 0.001407, 0.000973, 0.001418, 0.001034, 0.000684, 0.000937, 0.000933, 0.001085, 0.001289, 0.001361, 0.000949, 0.001028, 0.001383, 0.001428, 0.001618, 0.001456, 0.001586, 0.001880, 0.001824, 0.002599, 0.002735, 0.002613, 0.003712, 0.003805, 0.005025, 0.004538, 0.004156, 0.004813, 0.004727, 0.006535, 0.007961, 0.007828, 0.007898, 0.008708, 0.010078, 0.013314, 0.013609, 0.015112, 0.016745, 0.016766, 0.020449, 0.022059, 0.022400, 0.026670, 0.027975, 0.032242, 0.032829, 0.038957, 0.039702, 0.046757, 0.051496, 0.056179, 0.062836, 0.063975, 0.071973, 0.076188, 0.090343, 0.089612, 0.100477, 0.104900, 0.118855, 0.135289, 0.141978, 0.157365, 0.161306, 0.169835, 0.198279, 0.197704, 0.228076, 0.230895, 0.249000, 0.289030, 0.337058, 0.282705, 0.335130, 0.322072,0.392166, 0.327260, 0.270787, 0.527633, 0.264859, 0.358820, 0.981684, 1.000000)

data <- data.frame(age,pop,mortality_p)

One way of simulating mortality would be to take the number of recorded deaths, 259, and randomly allocating these among the chohorts, based on their repsective probability of dying.

In order to be realistic, most of these deaths should occur among the elderly, but for a large population, unlikely events - like a 10-year old dying - should occationally happen. Also, no deaths can occur for age cohorts consisting of 0 people.

Using R, are there any suggestions for how to distribute 259 elements between 107 cohorts, based on an associated probability and number of people in each cohort?


Solution

  • The deaths follow a weighted urn model. I don't know of an R function to generate samples efficiently with large numbers, but with the moderately sized numbers in your example, it can be done with tabulate and sample:

    data$deaths <- with(
      data,
      tabulate(
        sample(
          rep.int(seq_along(age), pop), 259, 1,
          rep.int(mortality_p, pop)
        ),
        nrow(data)
      )
    )