Search code examples
rtruncated

Generate data from truncated normal distribution with exact mean and sd in R


I struggle with the following task: I need to generate data from a truncated normal distribution. The sample mean and standard deviation should match exactly those specified in the population. This is what I have so far:

    mean <- 100
    sd <- 5
    lower <- 40
    upper <- 120
    n <- 100   

    library(msm)    
    data <- as.numeric(mean+sd*scale(rtnorm(n, lower=40, upper=120)))

The sample that's created takes on exactly the mean and sd specified in the population. But some values exceed the intended bounds. Any idea how to fix this? I was thinking of just cutting off all values outside these bounds, but then mean and sd don't resemble those of the population anymore.


Solution

  • You could use an iterative answer. Here I add samples one by one to the vector, but only if the resulting scaled dataset remains within the boundaries that you set. It takes longer, but it works:

    n <- 10000
    mean <- 100
    sd <- 15
    lower <- 40
    upper <- 120
    
    data <- rtnorm(1, lower=((lower - mean)/sd), upper=((upper - mean)/sd))
    while (length(data) < n) {
      sample <- rtnorm(1, lower=((lower - mean)/sd), upper=((upper - mean)/sd))
      data_copy = c(data, sample)
      data_copy_scaled = mean + sd * scale(data_copy)
      if (min(data_copy_scaled) >= lower & max(data_copy_scaled) <= upper) {
        data = c(data, sample)
      }
    }
    
    scaled_data = as.numeric(mean + sd * scale(data))
    
    summary(scaled_data)
    
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      40.38   91.61  104.35  100.00  111.28  120.00
    
    sd(scaled_data)
    
    15
    

    Below my old answer, which doesn't quite work

    How about scaling the lower and upper limits of rtnorm with the mean and sd that you want?

    n <- 1000000
    mean <- 100
    sd <- 5
    
    library(msm)
    
    data <- as.numeric(mean+sd*scale(rtnorm(n, lower=((40 - mean)/sd), upper=((120 - mean)/sd))))
    
    summary(data)
    
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      76.91   96.63  100.00  100.00  103.37  120.00 
    
    sd(data)
    
    5
    

    In this case, even with a sample of 1000000 you get the exact mean and sd, and the max and min values remain within your boundaries.