Search code examples
pythonmachine-learningscikit-learnpolynomials

Python PolynomialFeatures transforms data into different shape from the original one


I'm using sklearn's PolynomialFeatures to preprocess data into various degree transformations in order to compare their model fit. Below is my code:

    from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
# x and y are the original data
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
# using .PolynomialFeatures and fit_transform to   transform original data to degree 2
poly1 = PolynomialFeatures(degree=2)
x_D2_poly = poly1.fit_transform(x)
#check out their dimensions   
x.shape
x_D2_poly.shape

However, the above transformation returned an array of (1, 5151) from the original x of (100, 1). This is not what I have expected. I couldn't figure out what's wrong with my code. It will be great if someone could point out the error of my code or misconception on my part. Should I use alternative methods to transform original data instead?

Thank you.

Sincerely,

[update] So after I used x = x.reshape(-1, 1) to transform the original x, Python does give me the desired output dimension (100, 1) via poly1.fit_transform(x). However, when I did a train_test_split, fitted the data, and tried to obtain predicted values:

x_poly1_train, x_poly1_test, y_train, y_test = train_test_split(x_poly1, y, random_state = 0)
linreg = LinearRegression().fit(x_poly1_train, y_train)
poly_predict = LinearRegression().predict(x)

    

Python returned an error message:

shapes (1,100) and (2,) not aligned: 100 (dim 1) != 2 (dim 0)

Apparently, there must be somewhere I got the dimensional thing wrong again. Could anyone shed some light on this?

Thank you.


Solution

  • I think you need to reshape your x like

    x=x.reshape(-1,1)
    

    Your x had shape (100,) not (100,1) and fit_transform expects 2 dimensions. The reason you were getting 5151 features is that you were seeing one feature for each distinct pair (100*99/2 = 4950), one feature for each feature squared (100), 1 feature for first power of each feature (100), and one the 0th power (1).

    Response to your edited question: You need to call transform to convert the data you wish to predict on.