Search code examples
rlogistic-regression

bagging logistic regression in r


R_blogger provides the following code, where my additions are commented out because they don't work; I am seeking a way to save coefficient vectors and p values from the iterated logistic regressions so I can prune variables that consistently don't score well.

predictions <- foreach(m=1:iterations,.combine=cbind) %do% {
  training_positions <- sample(nrow(training2), size=floor((nrow(training2)/length_divisor)))
  train_pos<-1:nrow(training2) %in% training_positions
  glm_fit <- glm(default~. ,data=training2[train_pos,],family=binomial(logit), 
                 type=response, control = list(maxit = 25))
  predict(glm_fit,newdata=testing)
  #pvalues <- summary(glm_fit)$coeff[-1,4] < 0.0001
  #coeffs <- summary(glm_fit)$coeff[-1,3] 
  }
probs <- rowMeans(predictions)

I want to be able to do retrieve objects for coefficients and p values similar to predictions


Solution

  • NB This response has been reworked based on the exchange in the comments.

    So there are several things going on here.

    1. I assumed that the dataset training which you provided is supposed to be the same as training2 in your code. The first column in this dataset is an id, and your code will include that as a parameter in the fit. Is that what you wanted??
    2. Your code for extracting a sample of rows is unnecessarily complex. You generate a sample of integers between 1 and nrow(training2), and from that generate a vector of logical with length=nrow(training2). You don't need to do that: just use the vector of integers to index training2. It is much faster, especially with such a large dataset.
    3. When attempting a fit with such a large number of parameters (>1400), glm(...) seems to want an initial estimate of the means. Rather than spending time on that I just restricted the model to the first 9 parameters (columns 2:10).
    4. Using type=predict in the call to glm(..) affects how weights are used. You are not using weights, so this parameter does nothing.
    5. However, in the call to predict(...) you do need to specify type="predict".
    6. Using maxit = 25 generally meant the fits did not converge, so you need to experiment with that.
    7. In the small set of iterations I tried, none of the coefficients had p<0.0001, so I changed the cutoff to 0.1 for the sake of the example.
    8. And finally, using return(list(...)) as in the code below, plus changing .combine=cbind to .combine=rbind returns an array of list objects, where each row corresponds to an iteration, and column 1 has the vector of predictions for that iteration, column 2 has the vector of p-values for that iteration, and column 3 has the vector of coefficients for that iteration.

    Here's the code:

    library(foreach)
    set.seed(1)
    training2      <- training
    length_divisor <- 1000
    iterations     <- 5
    predictions <- foreach(m=1:iterations,.combine=rbind) %do% {
      training_positions <- sample(nrow(training2), 
                                   size=floor((nrow(training2)/length_divisor)))
    #  train_pos<-1:nrow(training2) %in% training_positions
      glm_fit <- glm(default~ . ,
                     data=training2[training_positions,c(2:10,ncol(training2))],
                     family=binomial(logit),
                     control = list(maxit = 25))
      pr <- predict(glm_fit,
                    newdata=training2[sample(1:nrow(training2),10),], 
                    type="response")
      s <- summary(glm_fit)
      p <- s$coeff[,4]
      c <- s$coeff[,1]
      pvalues <- p[p<0.1]
      coeffs  <- c[p<0.1]
      return(list(pr,pvalues,coeffs))
    }
    predictions
    #          [,1]       [,2]      [,3]     
    # result.1 Numeric,10 Numeric,0 Numeric,0
    # result.2 Numeric,10 Numeric,0 Numeric,0
    # result.3 Numeric,10 Numeric,2 Numeric,2
    # result.4 Numeric,10 Numeric,0 Numeric,0
    # result.5 Numeric,10 Numeric,0 Numeric,0
    

    So in this arrangement, predictions[,1] is a list of all the prediction vectors, prediction[,2] is a list of all the p-values<0.1 for each iteration, and prediction[,3] is a list of all the coefficients with p-value<0.1 for each iteration.