I'm using sklearn's PolynomialFeatures to preprocess data into various degree transformations in order to compare their model fit. Below is my code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
# x and y are the original data
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
# using .PolynomialFeatures and fit_transform to transform original data to degree 2
poly1 = PolynomialFeatures(degree=2)
x_D2_poly = poly1.fit_transform(x)
#check out their dimensions
x.shape
x_D2_poly.shape
However, the above transformation returned an array of (1, 5151) from the original x of (100, 1). This is not what I have expected. I couldn't figure out what's wrong with my code. It will be great if someone could point out the error of my code or misconception on my part. Should I use alternative methods to transform original data instead?
Thank you.
Sincerely,
[update] So after I used x = x.reshape(-1, 1) to transform the original x, Python does give me the desired output dimension (100, 1) via poly1.fit_transform(x). However, when I did a train_test_split, fitted the data, and tried to obtain predicted values:
x_poly1_train, x_poly1_test, y_train, y_test = train_test_split(x_poly1, y, random_state = 0)
linreg = LinearRegression().fit(x_poly1_train, y_train)
poly_predict = LinearRegression().predict(x)
Python returned an error message:
shapes (1,100) and (2,) not aligned: 100 (dim 1) != 2 (dim 0)
Apparently, there must be somewhere I got the dimensional thing wrong again. Could anyone shed some light on this?
Thank you.
I think you need to reshape your x like
x=x.reshape(-1,1)
Your x had shape (100,) not (100,1) and fit_transform expects 2 dimensions. The reason you were getting 5151 features is that you were seeing one feature for each distinct pair (100*99/2 = 4950), one feature for each feature squared (100), 1 feature for first power of each feature (100), and one the 0th power (1).
Response to your edited question: You need to call transform to convert the data you wish to predict on.