Search code examples
pythonnumpyoptimizationnormal-distributionpyomo

Why samples generated by np.random.multivariate_normal method are not compatible with covariance matrix?


I am working on data-driven robust optimization approach. In the numerical results part, I need to validate the method using sample data. I use np.random.multivariate_normal to generate the data for which I give the covariance matrix and mean vector as follows:

mean = [-1000, 1,1,1]
cov = [[200,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]

After generating the data, the first component of the samples are not distributed in the interval [-1200,-800], and instead, they belong to a smaller interval (e.g., [-1003,-997]). I need to generate sample data whose first components are more extensively distributed through the interval [-1200,-800]. The code that I use is as follows:

import numpy as np
import matplotlib.pyplot as plt
from numpy.random import multivariate_normal
# First 2D gaussian:
n=50 #number of samples
mean = [-1000, 1,1,1]
cov = [[200,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
Samples = np.random.multivariate_normal(mean, cov, n).T

I do not know where I am going wrong.


Solution

  • It seems that you are asking for a mean of -1000 and a variance of 1 (or rather, the original question was doing so; it appears to have been updated to 200, which however does not match the observations of -1003:-997 stated later).

    Note that we expect 99.9% of the values in an infinite population to be within 3 sigma (standard deviation, $\sqrt{Var}$), which for you is essentially 3. So your results are expected (see image). enter image description here

    If you want a larger variance you will need to specify that in cov.

    Guess and check to get the right variance

    Here is some messing around I did to show this gradually increasing the variance:

    >>> import numpy as np
    # increase n to get more asymptotic
    >>> n = 1000
    # what you had
    >>> mean = [-1000, 1,1,1]
    >>> cov = [[1,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
    >>> Samples = np.random.multivariate_normal(mean, cov, n).T
    >>> min(Samples[0,:])
    -1003.1521026984535
    # larger variance ([0, 0] element)
    >>> cov = [[200,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
    >>> Samples = np.random.multivariate_normal(mean, cov, n).T
    >>> min(Samples[0,:])
    -1058.8437937762053
    # yet larger variance
    >>> cov = [[2000,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
    >>> Samples = np.random.multivariate_normal(mean, cov, n).T
    >>> min(Samples[0,:])
    -1145.3564799931166
    # yet larger variance
    >>> cov = [[4000,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
    >>> Samples = np.random.multivariate_normal(mean, cov, n).T
    >>> min(Samples[0,:])
    -1247.6489017302786 
    

    Another approach to find the desired variance

    You can additionally figure this out analytically. Let's say that you want -1200 to -800 to be your range, and your mean is -1000. Thus, you want 3 sigma to be 200, so you want sigma to be ~66.7.

    Variance is sigma squared, so $66.7^2 = 4448.89$. This turns out to be quite close to the answer I arrived at by guess and check, a variance of 4000.

    Thus, based on the desired range of your data, you can choose variances in a disciplined fashion.