I'm trying to perform a ridge regression, so I started trying to split the data into two subsamples, I would like to mention that my data is imputed, so there are no NA
values in (I double check with function vis_miss
):
# Sample data from my current data set
loc <- sample(1:nrow(main_df_nipals_imputed_1),
round(nrow(main_df_nipals_imputed_1) * 0.9))
# Divide data set into training and testing set
# Training
read_training <- main_df_nipals_imputed_1[loc, ]
# Test
read_test <- main_df_nipals_imputed_1[-loc, ]
After this, I apply ridge regression
ridge <- cv.glmnet(x = read_training,
y = read_test,
type.measure = "mse",
alpha = 0,
family = "gaussian",
nlambda = 200)
Which gave me the following error:
Error in glmnet(read_training, read_test) :
number of observations in y (10) not equal to the number of rows of x (94)
I think that the error is pretty intuitive, so I decided to check the dimensions of my data:
> dim(read_test)
[1] 10 84
> dim(read_training)
[1] 94 84
I think that at this point, the output should be, maybe I`m wrong:
> dim(read_test)
[1] 10 84
> dim(read_training)
[1] 84 94
So far, the only answer to explain this is that I have missing values, which is not the case.
So, I checked if the training
and test
sets must be the same size, however, that is not the case, I think.
Before sharing my data, I would like to center this post in the context of what is incorrect in the data preparation for the ridge regression.
Question:
You misunderstand what the x
and y
parameters represent. Per the documentation, x
is a matrix of predictor features from a data set, and y
is are the corresponding prediction targets from the same dataset. The number of observations in x
and y
must therefore be the same. The function then performs k-fold cross validation with that dataset to tune the elastic net mixing constants.
As you've done, you'd typically split the whole dataset into training and test partitions. You'd supply the training partition to the cv.glmnet
function to perform hyperparameter tuning. Once you have the best hyperparameters, you could retrain the model in the full training set using the optimal hyperparameters. You would then use the test partition to assess that model's ability to generalize to unseen data.
To answer your first question more directly, the argument to x
should include the subset of columns in read_training
that are used as predictive features in your model. The argument to y
should be the prediction target column.
To answer your second question, no, the training and test datasets do not need to have the same number of observations, but they should have the same set of features.