SAS Proc IML simulating using empirical distribution

I am trying to simulate data using an empirical distribution. For example, say there are five outcomes with probabilities as shown in the vector below:

PROBABILITY_VECTOR = [0.1, 0.2, 0.3, 0.25, 0.15]

The PROBABILITY_VECTOR is calculated from empirical data - so for the first category in that vector, while the average probability is 0.1, there is considerable variance among the samples. Similarly, the last category, while average from all the samples is 0.15, there is considerable variance. The middle categories with 0.3 and 0.25 probabilities are fairly tight.

I use the PROC IML, with these statements:

CALL RANDSEED(12345);
CALL RANDGEN(SAMPLE, "TABLE", PROBABILITY_VECTOR);

When I do this, the average of all the simulated outcomes is consistent with the probability vector, as you would expect. But if I want my simulated trials to also show the wide variance that I observe in some of the categories in my data, how do I do that? Any ideas?

Solution

It sounds like you have k groups of subjects, and the sizes of the groups are N_1, N_2, ..., N_k. For each group, you have measured the proportion of subjects that have some characteristic of interest. The proportions are p_1, p_2, ..., p_k.

To simulate data like these, first take a random draw from a multinomial distribution that has N=N_1+N_2+...+N_k subjects and the probability of membership is N_1/N, N_2/N, ..., N_k/N. This will give you a new sample that N subjects spread across k groups, and each group has approximately the same number of subjects as the data. This explains why some groups have "wide variance" whereas others are "tight."

To simulate which subjects in the group have the characteristic, use the binomial(p_i, N_i) distribution. This will randomly assign the characteristic to some of the subjects in the i_th group.

If you repeat this process over and over, you will see that the smaller groups have more variation than the larger groups. I have written a detailed explanation, including a SAS/IML program and graphics that visualize the variation among the groups. See the article, "Simulate proportions for groups."