Search code examples
rfor-loopif-statementcross-validation

Correspond Indicator Matrix to Column Names in R


the question I am trying to solve is "How can I create an automated series of code which will pull the desired column header names from a data set to fit a general linearized model (glm)?" I have a data set with 8 variables; however, I only want to use 3 of them to cross validate and find the "best" model. Here's what I've come up with:

library(boot)
salary <- read.csv("salary_data.csv")
vars <- colnames(salary[c(2,3,7)])
nvars <- length(vars)
list.to.expand = vector(mode = "list", length = nvars)
for (i in 1:nvars){
  list.to.expand[[i]]=c(0,1)
}
model.spec.matrix <- expand.grid(list.to.expand)
vars
model.spec.matrix
names(model.spec.matrix) <- vars
vars.to.use <- model.spec.matrix[2,]
vars.to.use <- as.numeric(vars.to.use)
cn <- c()
for (i in 1:nrow(model.spec.matrix)){
  if(i==1){cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)]) 
  }
}
print(cn)
paste(cn, collapse = "+")
glm.out = glm(paste("LogACG~",paste(cn,collapse = "+"),sep = ""), family = gaussian, data = salary)
cv.err = cv.glm(salary, glm.out, K = 10)$delta[1]

My issue lies in the for loop. I've tried to construct a loop which will append the values from "vars" into "model.use", but I cannot seem to get it to read past the 2nd row in the matrix. Any suggestions? Thanks


Solution

  • It seems like several things are going on here.

    You've made LogACG the explanatory variable in your 2nd-to-last line ("LogACG~, but it's also one of the vars which ultimately end up in model.vars because of vars <- colnames(salary[c(2,3,7)]), so that's not right.

    Next your 2nd for loop should be iterating through the rows of the model.spec.matrix, i.e.

    for(i in 1:nrow(model.spec.matrix)){

    and to programmatically capture the column names (variables) indicated by that row you could do

    cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])

    within the loop. You should also move the glm and cv.glm inside of the loop.

    However, that will overwrite the glm.out and cv.err each time, so you'll want to create them as empty lists and append the list in each iteration.

    So the final product will look like this:

    # Since you can't use LogACG to explain itself, 
    #    suppose you meant to use Engineering as a candidate X
    vars <- colnames(salary[c(2,3,8)])
    
    # Make your grid
    model.spec.matrix        <- expand.grid(list.to.expand)
    names(model.spec.matrix) <- vars
    
    glm.out <- list(rep(NA, nrow(model.spec.matrix)))
    cv.err  <- list(rep(NA, nrow(model.spec.matrix)))
    
    for(i in 1:nrow(model.spec.matrix)){
      cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
      if(length(cn) == 0){cn <- "."}  
      tmp        <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
      glm.out[i] <- capture.output(tmp$formula)
    }
    # > glm.out
    # [[1]]
    # [1] "LogACG ~ ."
    # 
    # [[2]]
    # [1] "LogACG ~ Rank_Code"
    # 
    # [[3]]
    # [1] "LogACG ~ Gender"
    # 
    # [[4]]
    # [1] "LogACG ~ Rank_Code + Gender"
    # 
    # [[5]]
    # [1] "LogACG ~ Engineering"
    # 
    # [[6]]
    # [1] "LogACG ~ Rank_Code + Engineering"
    # 
    # [[7]]
    # [1] "LogACG ~ Gender + Engineering"
    # 
    # [[8]]
    # [1] "LogACG ~ Rank_Code + Gender + Engineering"
    

    To get the entire model object in the list element, replace

    glm.out[i] <- capture.output(tmp$formula)
    

    with

    glm.out    <- append(glm.out, tmp)
    

    i.e.

    for(i in 1:nrow(model.spec.matrix)){
      cn <- colnames(model.spec.matrix[sapply(model.spec.matrix[i,], function(x) x > 0)])
      if(length(cn) == 0){cn <- "."}  
      tmp        <- glm(as.formula(paste("LogACG~",paste(cn,collapse = "+"),sep = "")), family = gaussian, data = salary)
      tmp1       <- cv.glm(salary, tmp, K = 10)$delta[1]
      glm.out    <- append(glm.out, tmp)
      cv.err     <- append(cv.err, tmp1)
    }
    
    > tail(cv.err)
    [[1]]
    [1] 2.751025
    
    [[2]]
    [1] 2.758954
    
    [[3]]
    [1] 2.735063
    
    [[4]]
    [1] 2.768075
    
    [[5]]
    [1] 2.774433
    
    [[6]]
    [1] 2.748291
    

    Also, you may be better served to make use of a package like bestglm or just the step function from the default pacakge stats (see ?step).