r statistics data-modeling modeling cross-validation

Crossvalidation Lasso regression

I am currently working with the lasso for feature selection. First I perform a 10-fold crossvalidation to find the shrinkage parameter with the lowest MSE. I now try to calculate the MSE of the trainings-set myself, however, this value does not fit with the cv-plot.

cv <- cv.glmnet(as.matrix(mtcars[,c(1,3:9)]), mtcars[,c(2)], alpha=1, nfolds=10, type.measure="mse")
plot(cv)

lasso.mod <- glmnet(as.matrix(mtcars[,c(1,3:9)]),mtcars[,c(2)],alpha=1,lambda=cv$lambda.min)
y <- predict(lasso.mod, s=cv$lambda.min, newx=as.matrix(mtcars[,c(1,3:9)]))
mean((mtcars[,c(2)]-y)^2) # calculate MSE

What is the difference between the formula above and below? The formula below was said to provide the MSE of the lasso, but why are both values not identical? To be precise, I use the same dataset for the crossvalidation as for the calculation of the MSE.

cv$cvm[cv$lambda == cv$lambda.min]

Solution

The cross-validation MSE should not be equal to MSE of the whole training data set, because they are totally two different conceptions.

Cross-validation MSE for a certain lambda is: if you divide the training data set into 10 parts, do the following for each part: fit the lasso model using the lambda and 9 other parts and calculate MSE on the part, and calculate average for the 10 MSEs you've got. This is the cross-validation MSE and it's totally different with MSE on training data sets.