Search code examples
pythonscikit-learnlinear-regressionpipeline

Pipeline with PolynomialFeatures and LinearRegression - unexpected result


with the following code I just want to fit a regression curve to sample data which is not working as expected.

X = 10*np.random.rand(100)
y= 2*X**2+3*X-5+3*np.random.rand(100)
xfit=np.linspace(0,10,100)


poly_model=make_pipeline(PolynomialFeatures(2),LinearRegression())
poly_model.fit(X[:,np.newaxis],y)


y_pred=poly_model.predict(X[:,np.newaxis])


plt.scatter(X,y)
plt.plot(X[:,np.newaxis],y_pred,color="red")

plt.show()

enter image description here

Shouldnt't there be a curve which is perfectly fitting to the data points? Because the training data (X[:,np.newaxis]) and the data which get used to predict y_pred are the same (also (X[:,np.newaxis]).

If I instead use the xfit data to predict the model the result is as desired...

...

y_pred=poly_model.predict(xfit[:,np.newaxis])

plt.scatter(X,y)
plt.plot(xfit[:,np.newaxis],y_pred,color="red")

plt.show()

enter image description here

So whats the issue and the explanation for such a behaviour?


Solution

  • The difference between the two plots is that in the line

    plt.plot(X[:,np.newaxis],y_pred,color="red")
    

    The values in X[:,np.newaxis] are not sorted, while in

    plt.plot(xfit[:,np.newaxis],y_pred,color="red")
    

    the values of xfit[:,np.newaxis] are sorted.

    Now, plt.plot connects any two consecutive values in the array by line, and since they are not sorted you get this bunch of lines in your first figure.

    Replace

    plt.plot(X[:,np.newaxis],y_pred,color="red")
    

    with

    plt.scatter(X[:,np.newaxis],y_pred,color="red")
    

    and you'll get this nice looking figure:

    enter image description here