I'm using glmnet package in R for ridge regression. I tried on Hitters dataset from ISLR package. The problem is, when I use model.matrix to create the design matrix, the number of observations reduced for unknown reason. This is the code.
library(ISLR)
library(glmnet)
data("Hitters")
set.seed(1)
train=sample(1:nrow(Hitters), nrow(Hitters)/2)
test=(-train)
train.data = Hitters[train,]
test.data = Hitters[test,]
train.x=model.matrix(Salary~.,train.data)[,-1]
train.y=train.data$Salary
In the code, I'm trying to predict salary variable using all other variables. The train.data has 161 observations while train.x has 131. I don't understand why that would occur and would appreciate any help.
You have NA
values in the Salary field.
You can identify the problem like this:
missing.players <- setdiff(rownames(train.data), rownames(train.x))
train.data[missing.players, ]