Search code examples
pythonpython-3.xscikit-learngridsearchcvk-fold

How to use GridSearchCV for polynomials of different degrees?


What I wanted to do is go over some OLS fits with different degrees of polynomials to see which degree performs better at predicting mpg given horsepower (while using LOOCV and KFold). I wrote the code, but I couldn't figure out how to apply the PolynomialFeatures function to every iteration using GridSearchCv, so I ended up writing this:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import LeaveOneOut, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error



df = pd.read_csv('http://web.stanford.edu/~oleg2/hse/auto/Auto.csv')[['horsepower','mpg']].dropna()

pows = range(1,11)
first, second, mse = [], [], 0     # 'first' is data for the first plot and 'second' is for the second one

for p in pows:
  mse = 0
  for train_index, test_index in LeaveOneOut().split(df):
      x_train, x_test = df.horsepower.iloc[train_index], df.horsepower.iloc[test_index]
      y_train, y_test = df.mpg.iloc[train_index], df.mpg.iloc[test_index]
      polynomial_features = PolynomialFeatures(degree = p)
      x = polynomial_features.fit_transform(x_train.values.reshape(-1,1))   #getting the polynomial
      ft = LinearRegression().fit(x,y_train)
      x1 = polynomial_features.fit_transform(x_test.values.reshape(-1,1))   #getting the polynomial
      mse += mean_squared_error(y_test, ft.predict(x1))
  first.append(mse/len(df))
    
for p in pows: 
    temp = []   
    for i in range(9):      # this is to plot a few graphs for comparison
        mse = 0
        for train_index, test_index in KFold(10, True).split(df):
            x_train, x_test = df.horsepower.iloc[train_index], df.horsepower.iloc[test_index]
            y_train, y_test = df.mpg.iloc[train_index], df.mpg.iloc[test_index]
            polynomial_features = PolynomialFeatures(degree = p)
            x = polynomial_features.fit_transform(x_train.values.reshape(-1,1))   #getting the polynomial
            ft = LinearRegression().fit(x,y_train)
            x1 = polynomial_features.fit_transform(x_test.values.reshape(-1,1))   #getting the polynomial
            mse += mean_squared_error(y_test, ft.predict(x1))
        temp.append(mse/10)
    second.append(temp)      


f, pt = plt.subplots(1,2,figsize=(12,5.1))
f.tight_layout(pad=5.0)
pt[0].set_ylim([14,30])
pt[1].set_ylim([14,30])
pt[0].plot(pows, first, color='darkblue', linewidth=1)
pt[0].scatter(pows, first, color='darkblue')
pt[1].plot(pows, second)
pt[0].set_title("LOOCV", fontsize=15)
pt[1].set_title("10-fold CV", fontsize=15)
pt[0].set_xlabel('Degree of Polynomial', fontsize=15)
pt[1].set_xlabel('Degree of Polynomial', fontsize=15)
pt[0].set_ylabel('Mean Squared Error', fontsize=15)
pt[1].set_ylabel('Mean Squared Error', fontsize=15)
plt.show()

It produces: graph

This is fully working and you can run it on your machine to test it out. This does exactly what I want, but it seems really excessive. I am asking for advice on how to improve it using GridSearchCv or anything else, really. I tried to pass the PolynomialFeatures as a pipeline with LinearRegression(), but it could not get it to change the x on the fly. A working example would be much appreciated.


Solution

  • This sort of thing seems the way to do it:

    pipe = Pipeline(steps=[
        ('poly', PolynomialFeatures(include_bias=False)),
        ('model', LinearRegression()),
    ])
    
    search = GridSearchCV(
        estimator=pipe,
        param_grid={'poly__degree': list(pows)},
        scoring='neg_mean_squared_error',
        cv=LeaveOneOut(),
    )
    
    search.fit(df[['horsepower']], df.mpg)
    
    first = -search.cv_results_['mean_test_score']
    

    (negative in that last line because the scorer is negative mse)

    And then plotting can proceed more or less the same way. (We're relying here on cv_results_ putting the entries in the same order as pows; you might want to plot using appropriate columns of pd.DataFrame(search.cv_results_) instead.)

    You can use RepeatedKFold to emulate your loop over KFold, though then you'd only get one plot; if you really want the separate plots, then you still need the outer loop, but the grid search with cv=KFold(...) can replace the inner loop.