with the following code I just want to fit a regression curve to sample data which is not working as expected.
X = 10*np.random.rand(100)
y= 2*X**2+3*X-5+3*np.random.rand(100)
xfit=np.linspace(0,10,100)
poly_model=make_pipeline(PolynomialFeatures(2),LinearRegression())
poly_model.fit(X[:,np.newaxis],y)
y_pred=poly_model.predict(X[:,np.newaxis])
plt.scatter(X,y)
plt.plot(X[:,np.newaxis],y_pred,color="red")
plt.show()
Shouldnt't there be a curve which is perfectly fitting to the data points? Because the training data (X[:,np.newaxis]) and the data which get used to predict y_pred are the same (also (X[:,np.newaxis]).
If I instead use the xfit data to predict the model the result is as desired...
...
y_pred=poly_model.predict(xfit[:,np.newaxis])
plt.scatter(X,y)
plt.plot(xfit[:,np.newaxis],y_pred,color="red")
plt.show()
So whats the issue and the explanation for such a behaviour?
The difference between the two plots is that in the line
plt.plot(X[:,np.newaxis],y_pred,color="red")
The values in X[:,np.newaxis]
are not sorted, while in
plt.plot(xfit[:,np.newaxis],y_pred,color="red")
the values of xfit[:,np.newaxis]
are sorted.
Now, plt.plot
connects any two consecutive values in the array by line, and since they are not sorted you get this bunch of lines in your first figure.
Replace
plt.plot(X[:,np.newaxis],y_pred,color="red")
with
plt.scatter(X[:,np.newaxis],y_pred,color="red")
and you'll get this nice looking figure: