My question is: In sklearn
, how is the cv_values_
given by RidgeCV
calculated? why is it different with output from metrics.mean_squared_error
?
For example,
X = [1,2,3,4,5,6,7,8,9,10]
X = np.array(X).reshape(-1,1)
y = np.array([1,3.5,4,4.9,6.1,7.2,8.1,8.9,10,11.1])
ax.plot(X, y, 'o')
ax.plot(X, X+1, '-') # help visualize
Say we train the Ridge model on X and y
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error
model = RidgeCV(alphas = [0.001], store_cv_values=True).fit(X, y)
Now the output of
mean_squared_error(y_true=y, y_pred=model.predict(X))
is 0.1204000013110009
, while the output of
model.cv_values_.mean()
is 0.24472577167818438
.
Why is there such a huge difference? Am I missing something obvious?
From the official website link
cv_values_
Cross-validation values for each alpha (if store_cv_values=True and cv=None). After fit() has been called, this attribute will contain the mean squared errors (by default) or the values of the {loss,score}_func function (if provided in the constructor).
In your case when you call the
model = RidgeCV(alphas = [0.001], store_cv_values=True).fit(X, y)
you have: cv=None
cv=None
means that you use the Leave-One-Out cross-validation.
So cv_values
stores the mean squared error for each sample using Leave-One out cross validation. In every fold you have only 1 test point and thus n = 1. So cv_values_
will give you the squared error for every point in your training data set when it was a part of the test fold.
Finally, this means that when you call model.cv_values_.mean()
, you get the mean of these individual errors (mean of each error for each point). To see these individual error you can use print(model.cv_values_)
Individual means that the n=1 in the following equation:
On the other hand, mean_squared_error(y_true=y, y_pred=model.predict(X))
means that you put n=10 in this equation.
So the 2 results will differ.