python scikit-learn linear-regression polynomials

Fitting a higher degree function using PolynomialFeatures and LinearRegression

In a book I have found the following code which fits a LinearRegression to quadratic data:

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)

But how could that be? I know from the documentation that PolynomialFeatures(degree=2, include_bias=False) is creating an array which looks like:

[[X[0],X[0]**2]
[X[1],X[1]**2]
.....
[X[n],X[n]**2]]

BUT: How is the LinearRegression able to fit this data? Means WHAT is the LinearRegression doing and what is the concept behind this.

I am grateful for any explanations!

Solution

PolynomialFeatures with degree two will create an array that looks like:

   [[1, X[0], X[0]**2]
    [1, X[1], X[1]**2]
    .....
    [1, X[n] ,X[n]**2]]

Let's call the matrix above X. Then the LinearRegression is looking for 3 numbers a,b,c so that the vector

X* [[a],[b],[c]] - Y

has the smallest possible mean squared error (which is just the mean of the sum of the squares in the vector above).

Note that the product X* [[a],[b],[c]] is just a product of the matrix X with the column vector [a,b,c].T . The result is a vector of the same dimension as Y.

Regarding the questions in your comment:

This function is linear in the new set of features: x, x**2. Just think about x**2 as an additional feature in your model.
For the particular array mentioned in your question, the LinearRegression method is looking for numbers a,b,c that minimize the sum

(a*1+bX[0]+cX[0]**2-Y[0])**2+(a*1+bX[1]+cX[1]**2-Y[1])**2+..+(a*1+bX[n]+cX[n]**2-Y[n])**2

So it will find a set of such numbers a,b,c. Hence the suggested function y=a+b*x+c*x**2 is not based only on the first row. Instead, it is based on all the rows, because the parameters a,b,c that are chosen are those that minimize the sum above, and this sum involves elements from all the rows.

Once you created the vector x**2, the linear regression just regards it as an additional feature. You can give it a new name v=x**2. Then the linear regression is of the form y=a+b*x+c*v, which means, it is linear in x and v. The algorithm does not care how you created v. It just treats v as an additional feature.