Search code examples
pythonnumpymachine-learningscikit-learnpolynomials

How to use .reshape() for PolynomialFeatures in Sklearn.preprocessing to get the correct number of inputs for Multinomial polynomial regression?


I am learning about sklearn especially polynomial model fitting. Using the PolynomialFeatures function to a 2nd degree polynomial, there is something I am not understanding about how the LinearRegression() functionality expects to read in data based on the dataframe dimensions. Here is the error message I keep getting, followed by the code to replicate:

ValueError: X has 4 features, but LinearRegression is expecting 14 features as input.

Here is the code to replicate:

# Create dataframes
Dum_data = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
Dum_data_y = pd.DataFrame([[13],[14],[15]])

#Fit to 2 degree polynomial
poly_fit = PolynomialFeatures(degree = 2, include_bias = False)
Dum_poly = poly_fit.fit_transform(Dum_data)

print(Dum_data.shape, Dum_data_y.shape)
# #Fit the linear model to this
modl = LinearRegression()
modl.fit(Dum_poly, Dum_data_y)

# #Now get the predictions
Dum_y_pred = modl.predict(Dum_data)

I see a similar issue here converting to a numpy array and reshaping, but in the guides I am trying to use with polynomial regression...using scikit-learn and Multivariate regression with Python they seem to be passing in dataframes. I know I need to use the .reshape() function in some capacity, but after toying around with different dimensions of data, I cannot tell how to determine what number of features are expected. Thanks!


Solution

  • You can modify the final line of the code as follows:

    Dum_y_pred = modl.predict(Dum_poly)
    

    The original data contains 4 features: x1, x2, x3, and x4.

    When you apply the second order of PolynomialFeatures, it adds 10 more features: x1x1, x1x2, x1x3, x1x4, x2x2, x2x3, x2x4, x3x3, x3x4, and x4x4.

    In total, this results in 14 features to train your model. Therefore, your model can only accept data with these 14 features.