This is my X
:
X = np.array([[ 5., 8., 3., 4., 0., 5., 4., 0., 2., 5., 11.,
3., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 13.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 4., 4., 0., 3., 5., 12.,
2., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 4., 5., 12.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 3., 5., 13.,
3., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 4., 5., 11.,
4., 19., 2.],
[ 5., 8., 3., 4., 0., 2., 4., 0., 3., 5., 11.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.],
[ 5., 8., 3., 4., 0., 1., 4., 0., 3., 5., 12.,
5., 19., 2.]])
and this is my response y
y = np.array([ 70.14963195, 70.20937046, 70.20890363, 70.14310389,
70.18076206, 70.13179977, 70.13536797, 70.10700998,
70.09194074, 70.09958111])
Ridge Regression
# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X,y)
model.score(X,y) # gives 0.36898424479816627
# alpha = 0.01
model1 = Ridge(alpha = 0.01)
model1.fit(X,y)
model1.score(X,y) # gives 0.3690347045143918 > 0.36898424479816627
# alpha = 0.001
model2 = Ridge(alpha = 0.001)
model2.fit(X,y)
model2.score(X,y) #gives 0.36903522192901728 > 0.3690347045143918
# alpha = 0.0001
model3 = Ridge(alpha = 0.0001)
model3.fit(X,y)
model3.score(X,y) # gives 0.36903522711624259 > 0.36903522192901728
Thus from here it should be clear that alpha = 0.0001
is the best option. Indeed reading the documentation it says that the score is the coefficient of determination. If the coefficient closest to 1 describes the best model. Now let's see what RidgeCV
tells us
RidgeCV regression
modelCV = RidgeCV(alphas = [0.1, 0.01, 0.001,0.0001], store_cv_values = True)
modelCV.fit(X,y)
modelCV.alpha_ #giving 0.1
modelCV.score(X,y) # giving 0.36898424479812919 which is the same score as ridge regression with alpha = 0.1
What is going wrong? Surely we can check manually, as I have done, that all the other alphas are better. So not only it is not choosing the best alpha, but it is choosing the worst!
Can someone explain to me what it's going wrong?
That's perfectly normal behaviour.
Your manual approach is not doing any cross-validation and therefore train- and testdata are the same!
# alpha = 0.1
model = Ridge(alpha = 0.1)
model.fit(X,y) #!!
model.score(X,y) #!!
With some mild assumptions on the classifier (e.g convex-optimization problem) and the solver (guaranteed epsilon-convergence) this means, that you will always get the lowest score for the least regularized model (overfitting!): in your case: alpha = 0.0001
. (Have a look at RidgeRegression's formula)
Using RidgeCV though, cross-validation is by default activated, leave-one-out being selected. The scoring-process used to determine the best parameters is not using the same data for train and test.
You can print out the mean cv_values_
as you are using store_cv_values = True
:
print(np.mean(modelCV.cv_values_, axis=0))
# [ 0.00226582 0.0022879 0.00229021 0.00229044]
# alpha [0.1, 0.01, 0.001,0.0001]
# by default: mean squared errors!
# left / 0.1 best; right / 0.0001 worst
# this is only a demo: not sure how sklearn selects best (mean vs. ?)
This is expected, but not the general rule. As you are now scoring with two different datasets, you are optimizing not to overfit and with high probability some regularization is needed!