Search code examples
pythonscikit-learnlinear-regressiongrid-searchlasso-regression

Why GridSearchCV return score so different from the score returned by running model directly?


I used GridSearchCV to find the best alpha for lasso model.

alphas = np.logspace(-5, 2, 30)
grid = GridSearchCV(estimator=Lasso(),
param_grid=dict(alpha=alphas), cv=10, scoring='r2')
grid.fit(self.X, self.Y) # entire datasets were fed here

print grid.best_params_, grid.best_score_ # score -0.0470788758558
for params, mean_score, scores in grid.grid_scores_:
    print mean_score, params

I got the best parameter as 0.0014873521072935117, with negative r2 score -0.0470788758558.


Then I tried this alpha on the model directly. I ran the following code in a loop.

X_train, X_test, y_train, y_test = train_test_split(self.X, self.Y, train_size=0.7)
lasso = Lasso(alpha=0.001487)
lasso.fit(X_train, y_train)
print lasso.score(X_test, y_test)

Notice that I didn't set the random state, so it should work as a cross-validation. But the score I got here is around 0.11 (0.11-0.12) not matter how many times I ran the code.


Question

Why are the scores -0.0470788758558 and 0.11 so different for the two approaches?


Solution

  • I found the reason.

    cv should be set like this:

    cv = ShuffleSplit(n=len(X), n_iter=10, test_size=.3)
    

    when cv equals to integer, it means how many folds there are in each iteration not the number of iterations.