Search code examples
pythonstatisticsprobability-distribution

How to generate non-linear dependence between variables without correlation?


I am having trouble generating this data set for my dissertation from the following distribution.

My attempt results in this data set which looks more independent. I cannot seem to spot where I am going wrong. Could somebody help me out?

Here is the code:

# Non-linear dependence without correlation
import numpy as np
import matplotlib.pyplot as plt

x = np.random.uniform(-0.5, 0.5, 500)

def y_samples(x):
    y = []
    for i in x:
        if np.abs(i) <= 1/6:
            y.append(np.random.normal(0, 1/9))
        else:
            y.append(0.5 * np.random.normal(1, 1/9) + 0.5 * np.random.normal(-1, 1/9))
    return y    

y = y_samples(x)

plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Thanks!


Solution

  • You are handling the |x| > 1/6 case incorrectly, probably more because of a misunderstanding of the math than a misunderstanding of the code. The expression

    0.5 * np.random.normal(1, 1/9) + 0.5 * np.random.normal(-1, 1/9)
    

    yields a normal distribution centered on zero, not a bimodal distribution with centers at -1 and 1.

    The fix is obvious once you understood the math better: replace the offending calculation by something like

    np.random.normal(1.0, 1.0/9.0) if np.random.random() > 0.5 else np.random.normal(-1.0, 1.0/9.0)
    

    (1/9 evaluates to 0 in Python2, which I used for testing.)