I am having trouble generating this data set for my dissertation from the following distribution.
My attempt results in this data set which looks more independent. I cannot seem to spot where I am going wrong. Could somebody help me out?
Here is the code:
# Non-linear dependence without correlation
import numpy as np
import matplotlib.pyplot as plt
x = np.random.uniform(-0.5, 0.5, 500)
def y_samples(x):
y = []
for i in x:
if np.abs(i) <= 1/6:
y.append(np.random.normal(0, 1/9))
else:
y.append(0.5 * np.random.normal(1, 1/9) + 0.5 * np.random.normal(-1, 1/9))
return y
y = y_samples(x)
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Thanks!
You are handling the |x| > 1/6
case incorrectly, probably more because of a misunderstanding of the math than a misunderstanding of the code. The expression
0.5 * np.random.normal(1, 1/9) + 0.5 * np.random.normal(-1, 1/9)
yields a normal distribution centered on zero, not a bimodal distribution with centers at -1 and 1.
The fix is obvious once you understood the math better: replace the offending calculation by something like
np.random.normal(1.0, 1.0/9.0) if np.random.random() > 0.5 else np.random.normal(-1.0, 1.0/9.0)
(1/9
evaluates to 0 in Python2, which I used for testing.)