python scikit-learn linear-regression pipeline

Pipeline with PolynomialFeatures and LinearRegression - unexpected result

with the following code I just want to fit a regression curve to sample data which is not working as expected.

X = 10*np.random.rand(100)
y= 2*X**2+3*X-5+3*np.random.rand(100)
xfit=np.linspace(0,10,100)


poly_model=make_pipeline(PolynomialFeatures(2),LinearRegression())
poly_model.fit(X[:,np.newaxis],y)


y_pred=poly_model.predict(X[:,np.newaxis])


plt.scatter(X,y)
plt.plot(X[:,np.newaxis],y_pred,color="red")

plt.show()

Shouldnt't there be a curve which is perfectly fitting to the data points? Because the training data (X[:,np.newaxis]) and the data which get used to predict y_pred are the same (also (X[:,np.newaxis]).

If I instead use the xfit data to predict the model the result is as desired...

...

y_pred=poly_model.predict(xfit[:,np.newaxis])

plt.scatter(X,y)
plt.plot(xfit[:,np.newaxis],y_pred,color="red")

plt.show()

So whats the issue and the explanation for such a behaviour?

Solution

The difference between the two plots is that in the line

plt.plot(X[:,np.newaxis],y_pred,color="red")

The values in X[:,np.newaxis] are not sorted, while in

plt.plot(xfit[:,np.newaxis],y_pred,color="red")

the values of xfit[:,np.newaxis] are sorted.

Now, plt.plot connects any two consecutive values in the array by line, and since they are not sorted you get this bunch of lines in your first figure.

Replace

plt.plot(X[:,np.newaxis],y_pred,color="red")

with

plt.scatter(X[:,np.newaxis],y_pred,color="red")

and you'll get this nice looking figure: