Search code examples
rmathematical-optimization

Create a fake dataset that fits the following parameters: N, mean, sd, min, and max


Is there a way to create a fake dataset that fits the following parameters: N, mean, sd, min, and max?

I want to create a sample of 187 integer scale scores that have a mean of 67 and a standard deviation of 17, with observations within the range [30, 210]. I'm trying to demonstrate a conceptual lesson about statistical power, and I would like to create data with a distribution that looks like a published result. The scale score in this example is the sum of 30 items that each could range from 1 to 7. I don't need data for the individual items that make up the scale score, but that would be a bonus.

I know I could use rnorm(), but the values are not integers, and the min and max can exceed my possible values.

scaleScore <- rnorm(187, mean = 67, sd = 17)

I also know I could use sample() to get integers that stay within this range, but the mean and standard deviation won't be right.

scaleScore <- sample(30:210, 187, replace=TRUE)

@Pascal's tip led me to urnorm() in the Runuran package:

set.seed(5)
scaleScore <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210)
mean(scaleScore)
# [1] 68.51758
sd(scaleScore)
# [1] 16.38056
min(scaleScore)
# [1] 32.15726
max(scaleScore)
# [1] 107.6758

Mean and SD are not exact, of course, and the vector does not consist of integers.

Any other options?


Solution

  • Integer Optimization With No Template

    Since you want to have an exact mean, standard deviation, min, and max, my first choice wouldn't be random number generation, since your sample is unlikely to exactly match the mean and standard deviation of the distribution you're drawing from. Instead, I would take an integer optimization approach. You could define variable x_i to be the number of times integer i appears in your sample. You'll define decision variables x_30, x_31, ..., x_210 and add constraints that ensure all your conditions are met:

    • 187 samples: This can be encoded by the constraint x_30 + x_31 + ... + x_210 = 187
    • Mean of 67: This can be encoded by the constraint 30*x_30 + 31*x_31 + ... + 210*x_210 = 187 * 67
    • Logical constraints on variables: Variables must take non-negative integer values
    • "Looks Like Real Data" This is obviously an ill-defined concept, but we could require that the frequency of adjacent numbers have a difference of no more than 1. This is linear constraints of the form x_30 - x_31 <= 1, x_30 - x_31 >= -1, and so on for every consecutive pair. We can also require that each frequency does not exceed some arbitrarily defined upper bound (I'll use 10).

    Finally, we want the standard deviation to be as close to 17 as possible, meaning we want the variance to be as close as possible to 17^2 = 289. We can define a variable y to be an upper bound on how closely we match this variance, and we can minimize y:

    y >= ((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) - (289 * (187-1))
    y >= -((30-67)^2 * x_30 + (31-67)^2 * x_31 + ... + (210-67)^2 * x_210) + (289 * (187-1))
    

    This is a pretty easy optimization problem to solve with a solver like lpSolve:

    library(lpSolve)
    get.sample <- function(n, avg, stdev, lb, ub) {
      vals <- lb:ub
      nv <- length(vals)
      mod <- lp(direction = "min",
                objective.in = c(rep(0, nv), 1),
                const.mat = rbind(c(rep(1, nv), 0),
                                  c(vals, 0),
                                  c(-(vals-avg)^2, 1),
                                  c((vals-avg)^2, 1),
                                  cbind(diag(nv), rep(0, nv)),
                                  cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv)),
                                  cbind(diag(nv)-cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))),
                const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep("<=", nv), rep(">=", nv)),
                const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), rep(10, nv), rep(1, nv), rep(-1, nv)),
                all.int = TRUE)
      rep(vals, head(mod$solution, -1))
    }
    samp <- get.sample(187, 67, 17, 30, 210)
    summary(samp)
    #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #      30      64      69      67      74     119
    sd(samp)
    # [1] 17
    plot(table(samp))
    

    enter image description here

    For the parameters you provided, we were able to get the exact mean and standard deviation while returning all integer values, and the computation completed in my computer in 0.4 seconds.

    Integer Optimization With a Template

    Another approach to getting something that resembles "real data" would be to define a starting continuous distribution (e.g. the result of the urnorm function that you include in the original post) and to round the values to integers in a way that best achieves your mean and standard deviation objectives. This really only introduces two new classes of constraints: the upper bound on the number of samples at some value is the number of samples that could either be rounded up or down to achieve that value and a lower bound on the sum of two consecutive frequencies is the number of continuous samples that fall between those two integers. Again, this is easy to implement with lpSolve and not terribly inefficient to run:

    library(lpSolve)
    get.sample2 <- function(n, avg, stdev, lb, ub, init.dist) {
      vals <- lb:ub
      nv <- length(vals)
      lims <- as.vector(table(factor(c(floor(init.dist), ceiling(init.dist)), vals)))
      floors <- as.vector(table(factor(c(floor(init.dist)), vals)))
      mod <- lp(direction = "min",
                objective.in = c(rep(0, nv), 1),
                const.mat = rbind(c(rep(1, nv), 0),
                                  c(vals, 0),
                                  c(-(vals-avg)^2, 1),
                                  c((vals-avg)^2, 1),
                                  cbind(diag(nv), rep(0, nv)),
                                  cbind(diag(nv) + cbind(rep(0, nv), diag(nv)[,-nv]), rep(0, nv))),
                const.dir = c("=", "=", ">=", ">=", rep("<=", nv), rep(">=", nv)),
                const.rhs = c(n, avg*n, -stdev^2 * (n-1), stdev^2 * (n-1), lims, floors),
                all.int = TRUE)
      rep(vals, head(mod$solution, -1))
    }
    
    library(Runuran)
    set.seed(5)
    init.dist <- urnorm(n=187, mean=67, sd=17, lb=30, ub=210)
    samp2 <- get.sample2(187, 67, 17, 30, 210, init.dist)
    summary(samp2)
    #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #      32      57      66      67      77     107
    sd(samp2)
    # [1] 17
    plot(table(samp2))
    

    enter image description here

    This approach is even faster (under 0.1 seconds) and still returns a distribution that exactly meets the required mean and standard deviation. Further, given sufficiently high quality samples from continuous distributions, this can be used to get distributions of different shapes that take integer values and meet the required statistical properties.