Is there a way to create a fake dataset that fits the following parameters: N, mean, sd, min, and max?
I want to create a sample of 187 integer scale scores that have a mean of 67 and a standard deviation of 17, with observations within the range [30, 210]. I'm trying to demonstrate a conceptual lesson about statistical power, and I would like to create data with a distribution that looks like a published result. The scale score in this example is the sum of 30 items that each could range from 1 to 7. I don't need data for the individual items that make up the scale score, but that would be a bonus.
I know I could use rnorm()
, but the values are not integers, and the min and max can exceed my possible values.
scaleScore <- rnorm(187, mean = 67, sd = 17)
I also know I could use sample()
to get integers that stay within this range, but the mean and standard deviation won't be right.
scaleScore <- sample(30:210, 187, replace=TRUE)
@Pascal's tip led me to urnorm()
in the Runuran
package:
set.seed(5)
scaleScore <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210)
mean(scaleScore)
# [1] 68.51758
sd(scaleScore)
# [1] 16.38056
min(scaleScore)
# [1] 32.15726
max(scaleScore)
# [1] 107.6758
Mean and SD are not exact, of course, and the vector does not consist of integers.
Any other options?
Since you want to have an exact mean, standard deviation, min, and max, my first choice wouldn't be random number generation, since your sample is unlikely to exactly match the mean and standard deviation of the distribution you're drawing from. Instead, I would take an integer optimization approach. You could define variable x_i
to be the number of times integer i
appears in your sample. You'll define decision variables x_30
, x_31
, ..., x_210
and add constraints that ensure all your conditions are met:
x_30 + x_31 + ... + x_210 = 187
30*x_30 + 31*x_31 + ... + 210*x_210 = 187 * 67
x_30 - x_31 <= 1
, x_30 - x_31 >= -1
, and so on for every consecutive pair. We can also require that each frequency does not exceed some arbitrarily defined upper bound (I'll use 10).Finally, we want the standard deviation to be as close to 17 as possible, meaning we want the variance to be as close as possible to 17^2 = 289. We can define a variable y
to be an upper bound on how closely we match this variance, and we can minimize y:
y >= ((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) - (289 * (187-1))
y >= -((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) + (289 * (187-1))
This is a pretty easy optimization problem to solve with a solver like lpSolve
:
library(lpSolve)
get.sample <- function(n, avg, stdev, lb, ub) {
vals <- lb:ub
nv <- length(vals)
mod <- lp(direction = "min",
objective.in = c(rep(0, nv), 1),
const.mat = rbind(c(rep(1, nv), 0),
c(vals, 0),
c(-(vals-avg)^2, 1),
c((vals-avg)^2, 1),
cbind(diag(nv), rep(0, nv)),
cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv)),
cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))),
const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep("<=", nv), rep(">=", nv)),
const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), rep(10, nv), rep(1, nv), rep(-1, nv)),
all.int = TRUE)
rep(vals, head(mod$solution, -1))
}
samp <- get.sample(187, 67, 17, 30, 210)
summary(samp)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 30 64 69 67 74 119
sd(samp)
# [1] 17
plot(table(samp))
For the parameters you provided, we were able to get the exact mean and standard deviation while returning all integer values, and the computation completed in my computer in 0.4 seconds.
Another approach to getting something that resembles "real data" would be to define a starting continuous distribution (e.g. the result of the urnorm
function that you include in the original post) and to round the values to integers in a way that best achieves your mean and standard deviation objectives. This really only introduces two new classes of constraints: the upper bound on the number of samples at some value is the number of samples that could either be rounded up or down to achieve that value and a lower bound on the sum of two consecutive frequencies is the number of continuous samples that fall between those two integers. Again, this is easy to implement with lpSolve and not terribly inefficient to run:
library(lpSolve)
get.sample2 <- function(n, avg, stdev, lb, ub, init.dist) {
vals <- lb:ub
nv <- length(vals)
lims <- as.vector(table(factor(c(floor(init.dist), ceiling(init.dist)), vals)))
floors <- as.vector(table(factor(c(floor(init.dist)), vals)))
mod <- lp(direction = "min",
objective.in = c(rep(0, nv), 1),
const.mat = rbind(c(rep(1, nv), 0),
c(vals, 0),
c(-(vals-avg)^2, 1),
c((vals-avg)^2, 1),
cbind(diag(nv), rep(0, nv)),
cbind(diag(nv) + cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))),
const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep(">=", nv)),
const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), lims, floors),
all.int = TRUE)
rep(vals, head(mod$solution, -1))
}
library(Runuran)
set.seed(5)
init.dist <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210)
samp2 <- get.sample2(187, 67, 17, 30, 210, init.dist)
summary(samp2)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 32 57 66 67 77 107
sd(samp2)
# [1] 17
plot(table(samp2))
This approach is even faster (under 0.1 seconds) and still returns a distribution that exactly meets the required mean and standard deviation. Further, given sufficiently high quality samples from continuous distributions, this can be used to get distributions of different shapes that take integer values and meet the required statistical properties.