the question I am trying to solve is "How can I create an automated series of code which will pull the desired column header names from a data set to fit a general linearized model (glm)?" I have a data set with 8 variables; however, I only want to use 3 of them to cross validate and find the "best" model. Here's what I've come up with:
library(boot)
salary <- read.csv("salary_data.csv")
vars <- colnames(salary[c(2,3,7)])
nvars <- length(vars)
list.to.expand = vector(mode = "list", length = nvars)
for (i in 1:nvars){
list.to.expand[[i]]=c(0,1)
}
model.spec.matrix <- expand.grid(list.to.expand)
vars
model.spec.matrix
names(model.spec.matrix) <- vars
vars.to.use <- model.spec.matrix[2,]
vars.to.use <- as.numeric(vars.to.use)
cn <- c()
for (i in 1:nrow(model.spec.matrix)){
if(i==1){cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
}
}
print(cn)
paste(cn, collapse = "+")
glm.out = glm(paste("LogACG~",paste(cn,collapse = "+"),sep = ""), family = gaussian, data = salary)
cv.err = cv.glm(salary, glm.out, K = 10)$delta[1]
My issue lies in the for loop. I've tried to construct a loop which will append the values from "vars" into "model.use", but I cannot seem to get it to read past the 2nd row in the matrix. Any suggestions? Thanks
It seems like several things are going on here.
You've made LogACG
the explanatory variable in your 2nd-to-last line ("LogACG~
, but it's also one of the vars
which ultimately end up in model.vars
because of vars <- colnames(salary[c(2,3,7)])
, so that's not right.
Next your 2nd for
loop should be iterating through the rows of the model.spec.matrix
, i.e.
for(i in 1:nrow(model.spec.matrix)){
and to programmatically capture the column names (variables) indicated by that row you could do
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
within the loop. You should also move the glm
and cv.glm
inside of the loop.
However, that will overwrite the glm.out
and cv.err
each time, so you'll want to create them as empty lists and append the list in each iteration.
So the final product will look like this:
# Since you can't use LogACG to explain itself,
# suppose you meant to use Engineering as a candidate X
vars <- colnames(salary[c(2,3,8)])
# Make your grid
model.spec.matrix <- expand.grid(list.to.expand)
names(model.spec.matrix) <- vars
glm.out <- list(rep(NA, nrow(model.spec.matrix)))
cv.err <- list(rep(NA, nrow(model.spec.matrix)))
for(i in 1:nrow(model.spec.matrix)){
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
if(length(cn) == 0){cn <- "."}
tmp <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
glm.out[i] <- capture.output(tmp$formula)
}
# > glm.out
# [[1]]
# [1] "LogACG ~ ."
#
# [[2]]
# [1] "LogACG ~ Rank_Code"
#
# [[3]]
# [1] "LogACG ~ Gender"
#
# [[4]]
# [1] "LogACG ~ Rank_Code + Gender"
#
# [[5]]
# [1] "LogACG ~ Engineering"
#
# [[6]]
# [1] "LogACG ~ Rank_Code + Engineering"
#
# [[7]]
# [1] "LogACG ~ Gender + Engineering"
#
# [[8]]
# [1] "LogACG ~ Rank_Code + Gender + Engineering"
To get the entire model object in the list element, replace
glm.out[i] <- capture.output(tmp$formula)
with
glm.out <- append(glm.out, tmp)
i.e.
for(i in 1:nrow(model.spec.matrix)){
cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
if(length(cn) == 0){cn <- "."}
tmp <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
tmp1 <- cv.glm(salary, tmp, K = 10)$delta[1]
glm.out <- append(glm.out, tmp)
cv.err <- append(cv.err, tmp1)
}
> tail(cv.err)
[[1]]
[1] 2.751025
[[2]]
[1] 2.758954
[[3]]
[1] 2.735063
[[4]]
[1] 2.768075
[[5]]
[1] 2.774433
[[6]]
[1] 2.748291
Also, you may be better served to make use of a package like bestglm
or just the step
function from the default pacakge stats
(see ?step
).