I have been working my way through R ISLR College dataset and I'm wanting to perform the best subset selection on the training set, and plot the training set MSE associated with the best model of each size.
library(ISLR)
library(leaps)
data(College)
head(College)
#splitting the data into 70/30
subset<- sample(nrow(college)*0.7)
collegetrain<- college[subset,]
collegetest<-college[-subset,]
This is my code:
regfit.full <- regsubsets(apps ~ ., data = college.train, nvmax = 20)
train.mat <- model.matrix(apps ~ ., data = college.train, nvmax = 20)
val.errors <- rep(NA, 20)
for (i in 1:20) {
coefi <- coef(regfit.full, id = i)
pred <- train.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((pred - college.train$y)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Training MSE", pch = 19, type = "b")
The dataset is structured like this: 777 observations with 543 in the training set and 234 in the test set. There are 18 variables with 17 of them being numeric and 1 being a factor of yes and no (this doesn't need to be changed).
The error message i get when i run my code is: Error in s$which [id, , drop=FALSE]: subscript out of bounds
regfit.full <- regsubsets(Apps ~ ., data = collegetrain, nvmax = 20)
train.mat <- model.matrix(Apps ~ ., data = collegetrain, nvmax = 20)
val.errors <- rep(NA, 20)
for (i in 1:17) {
coefi <- coef(regfit.full, id = i)
pred <- train.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((pred - collegetrain$Apps)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Training MSE",
pch = 19, type = "b")