What I wanted to do is go over some OLS fits with different degrees of polynomials to see which degree performs better at predicting mpg
given horsepower
(while using LOOCV and KFold). I wrote the code, but I couldn't figure out how to apply the PolynomialFeatures
function to every iteration using GridSearchCv
, so I ended up writing this:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import LeaveOneOut, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
df = pd.read_csv('http://web.stanford.edu/~oleg2/hse/auto/Auto.csv')[['horsepower','mpg']].dropna()
pows = range(1,11)
first, second, mse = [], [], 0 # 'first' is data for the first plot and 'second' is for the second one
for p in pows:
mse = 0
for train_index, test_index in LeaveOneOut().split(df):
x_train, x_test = df.horsepower.iloc[train_index], df.horsepower.iloc[test_index]
y_train, y_test = df.mpg.iloc[train_index], df.mpg.iloc[test_index]
polynomial_features = PolynomialFeatures(degree = p)
x = polynomial_features.fit_transform(x_train.values.reshape(-1,1)) #getting the polynomial
ft = LinearRegression().fit(x,y_train)
x1 = polynomial_features.fit_transform(x_test.values.reshape(-1,1)) #getting the polynomial
mse += mean_squared_error(y_test, ft.predict(x1))
first.append(mse/len(df))
for p in pows:
temp = []
for i in range(9): # this is to plot a few graphs for comparison
mse = 0
for train_index, test_index in KFold(10, True).split(df):
x_train, x_test = df.horsepower.iloc[train_index], df.horsepower.iloc[test_index]
y_train, y_test = df.mpg.iloc[train_index], df.mpg.iloc[test_index]
polynomial_features = PolynomialFeatures(degree = p)
x = polynomial_features.fit_transform(x_train.values.reshape(-1,1)) #getting the polynomial
ft = LinearRegression().fit(x,y_train)
x1 = polynomial_features.fit_transform(x_test.values.reshape(-1,1)) #getting the polynomial
mse += mean_squared_error(y_test, ft.predict(x1))
temp.append(mse/10)
second.append(temp)
f, pt = plt.subplots(1,2,figsize=(12,5.1))
f.tight_layout(pad=5.0)
pt[0].set_ylim([14,30])
pt[1].set_ylim([14,30])
pt[0].plot(pows, first, color='darkblue', linewidth=1)
pt[0].scatter(pows, first, color='darkblue')
pt[1].plot(pows, second)
pt[0].set_title("LOOCV", fontsize=15)
pt[1].set_title("10-fold CV", fontsize=15)
pt[0].set_xlabel('Degree of Polynomial', fontsize=15)
pt[1].set_xlabel('Degree of Polynomial', fontsize=15)
pt[0].set_ylabel('Mean Squared Error', fontsize=15)
pt[1].set_ylabel('Mean Squared Error', fontsize=15)
plt.show()
This is fully working and you can run it on your machine to test it out. This does exactly what I want, but it seems really excessive. I am asking for advice on how to improve it using GridSearchCv
or anything else, really. I tried to pass the PolynomialFeatures
as a pipeline with LinearRegression()
, but it could not get it to change the x
on the fly. A working example would be much appreciated.
This sort of thing seems the way to do it:
pipe = Pipeline(steps=[
('poly', PolynomialFeatures(include_bias=False)),
('model', LinearRegression()),
])
search = GridSearchCV(
estimator=pipe,
param_grid={'poly__degree': list(pows)},
scoring='neg_mean_squared_error',
cv=LeaveOneOut(),
)
search.fit(df[['horsepower']], df.mpg)
first = -search.cv_results_['mean_test_score']
(negative in that last line because the scorer is negative mse)
And then plotting can proceed more or less the same way. (We're relying here on cv_results_
putting the entries in the same order as pows
; you might want to plot using appropriate columns of pd.DataFrame(search.cv_results_)
instead.)
You can use RepeatedKFold
to emulate your loop over KFold
, though then you'd only get one plot; if you really want the separate plots, then you still need the outer loop, but the grid search with cv=KFold(...)
can replace the inner loop.