Search code examples
pythonnumpylinear-algebracorrelation

I'm trying to generate synthetic data using Python. The data should be bivariate and have a specified correlation. Why doesn't my code work?


Here is what I've tried. I've been playing with this for a very long time and cannot figure out what I'm doing wrong. Can anyone help identify what I'm not seeing?

I'm trying to create 1,000 samples, each containing two variables, where one variable is correlated to the other with r=0.85 (or whatever correlation I specify). I don't really understand the cholesky decomposition, so I'm assuming that the problem lies somewhere in that step.

# Create random normal bivariate data with r=0.85
rng = np.random.default_rng(0)
correlation = 0.85
corr_matrix = np.array([[1, correlation], [correlation, 1]])
L = np.linalg.cholesky(corr_matrix)
n = 1000
random_data = rng.normal(size=(n, 2))
synthetic_data = np.dot(random_data, L)

# Check the correlation
r = stats.pearsonr(synthetic_data.T[0], synthetic_data.T[1])[0]

# r computes to 0.646.

Solution

  • Your multiplication of L and random_data isn't quite right. Change

    synthetic_data = np.dot(random_data, L)
    

    to

    synthetic_data = np.dot(random_data, L.T)
    

    See Generate correlated data in Python (3.3) for an alternative that uses the multivariate_normal method of the random generator. The link at the end of that answer goes to a SciPy cookbook page that is also worth checking out.