Search code examples
rrandomcorrelationgenerated

Generate random values in R with a defined correlation in a defined range


For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).

For example, I want to generate random values correlating with r=-.78 with the following data:

var1 <- rnorm(100, 50, 10)

I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.

Following the example:

var1 <- rnorm(100, 50, 10)
n     <- length(var1)                   
rho   <- -0.78                   
theta <- acos(rho)             
x1    <- var1      
x2    <- rnorm(n, 50, 50)      
X     <- cbind(x1, x2)         
Xctr  <- scale(X, center=TRUE, scale=FALSE)   

Id   <- diag(n)                               
Q    <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))       
P    <- tcrossprod(Q)          # = Q Q'       
x2o  <- (Id-P) %*% Xctr[ , 2]                 
Xc2  <- cbind(Xctr[ , 1], x2o)                
Y    <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))  
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]    
cor(var1, var2)  

What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.

Does anyone of you know a way to generate this kind of - more or less -meaningful data?

Thanks a lot in advance!


Solution

  • Starting with var1, renamed to A, and using 10,000 points:

    set.seed(1)
    A <- rnorm(10000,50,10)  # Mean of 50
    

    First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):

    B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
    

    This only results in r = -1. Add some noise to achieve the desired r:

    B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
                                   # the amount of noise, 8.15e3, was found through parameter-search
    

    This has your desired correlation:

    cor(A,B)
    [1] -0.7805972
    

    View with:

    plot(A,B)
    

    Caution
    Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.

    That said, the current range is fine:

    range(B)
    [1]  1668.733 95604.457