python scikit-learn statistics regression cross-validation

Scikit learn: RidgeCV seems not to give the best option?

This is my X:

 X =  np.array([[  5.,   8.,   3.,   4.,   0.,   5.,   4.,   0.,   2.,   5.,  11.,
              3.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   3.,   5.,  13.,
              4.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   4.,   4.,   0.,   3.,   5.,  12.,
              2.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   4.,   5.,  12.,
              4.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   3.,   5.,  12.,
              5.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   2.,   4.,   0.,   3.,   5.,  13.,
              3.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   2.,   4.,   0.,   4.,   5.,  11.,
              4.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   2.,   4.,   0.,   3.,   5.,  11.,
              5.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   3.,   5.,  12.,
              5.,  19.,   2.],
           [  5.,   8.,   3.,   4.,   0.,   1.,   4.,   0.,   3.,   5.,  12.,
              5.,  19.,   2.]])

and this is my response y

y = np.array([ 70.14963195,  70.20937046,  70.20890363,  70.14310389,
        70.18076206,  70.13179977,  70.13536797,  70.10700998,
        70.09194074,  70.09958111])

Ridge Regression

    # alpha = 0.1
    model = Ridge(alpha = 0.1)
    model.fit(X,y)
    model.score(X,y)   # gives 0.36898424479816627

    # alpha = 0.01
    model1 = Ridge(alpha = 0.01)
    model1.fit(X,y)
    model1.score(X,y)     # gives 0.3690347045143918 > 0.36898424479816627

    # alpha = 0.001
    model2 = Ridge(alpha = 0.001)
    model2.fit(X,y)
    model2.score(X,y)  #gives 0.36903522192901728 > 0.3690347045143918

    # alpha = 0.0001
    model3 = Ridge(alpha = 0.0001)
    model3.fit(X,y)
    model3.score(X,y)  # gives 0.36903522711624259 > 0.36903522192901728

Thus from here it should be clear that alpha = 0.0001 is the best option. Indeed reading the documentation it says that the score is the coefficient of determination. If the coefficient closest to 1 describes the best model. Now let's see what RidgeCV tells us

RidgeCV regression

modelCV = RidgeCV(alphas = [0.1, 0.01, 0.001,0.0001], store_cv_values = True)
modelCV.fit(X,y)
modelCV.alpha_  #giving 0.1
modelCV.score(X,y)  # giving 0.36898424479812919 which is the same score as ridge regression with alpha = 0.1

What is going wrong? Surely we can check manually, as I have done, that all the other alphas are better. So not only it is not choosing the best alpha, but it is choosing the worst!

Can someone explain to me what it's going wrong?

Solution

That's perfectly normal behaviour.

Your manual approach is not doing any cross-validation and therefore train- and testdata are the same!

# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X,y)   #!!
model.score(X,y) #!!

With some mild assumptions on the classifier (e.g convex-optimization problem) and the solver (guaranteed epsilon-convergence) this means, that you will always get the lowest score for the least regularized model (overfitting!): in your case: alpha = 0.0001. (Have a look at RidgeRegression's formula)

Using RidgeCV though, cross-validation is by default activated, leave-one-out being selected. The scoring-process used to determine the best parameters is not using the same data for train and test.

You can print out the mean cv_values_ as you are using store_cv_values = True:

print(np.mean(modelCV.cv_values_, axis=0))
# [ 0.00226582  0.0022879   0.00229021  0.00229044]
# alpha [0.1, 0.01, 0.001,0.0001]
# by default: mean squared errors!
# left / 0.1 best; right / 0.0001 worst 
# this is only a demo: not sure how sklearn selects best (mean vs. ?)

This is expected, but not the general rule. As you are now scoring with two different datasets, you are optimizing not to overfit and with high probability some regularization is needed!