Search code examples
rr-caretglmnet

Caret: glmnet warning - x should be a matrix with 2 or more columns


When I pass a single numeric variable as an independent variable to glmnet in caret, I get an error message saying "x should be a matrix with 2 or more columns", however when I pass a single factor variable then the train function performs as expected. Adding a factor variable to the single numeric variable also works as expected. Why is this? It is very problematic so far. I know that with glmnet you need to use a matrix and not a data frame, however caret should take care of this transformation, as it clearly does for the factor variable. Also, I need to be able to consistently implement my analysis within the caret framework, and I need my data to be as a data frame. Here is a sample, please ignore the warnings message resulting from too few observations which is not relevant for this problem.

Any help would be much appreciated as I am going crazy!

df <- structure(list(Y = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
                             1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
                                                                                         "Yes"), class = "factor"), A = c("Yes", "Yes", "No", "No", "No", 
                                                                                                                          "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "N", 
                                                                                                                          "No", "No", "No", "No", "No"), B = c(30, 6, 12, 12, 12, 12, 12, 
                                                                                                                                                               4, 12, 32, 12, 12, 4, 24, 8, 12, 15, 6, 12, 12), C = structure(c(1L, 
                                                                                                                                                                                                                                1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                                                                                                                                                                                1L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("Y", 
                                                                                                                                                                                                                                                                                                  "A", "B", "C"), row.names = c(NA, 20L), class = "data.frame")



# set up the grid
  tuneGrid <- expand.grid(.alpha = seq(0, 1, 0.05), .lambda = seq(0, 2, 0.05))
  ## 10-fold CV ##
  fitControl <- trainControl(method = 'cv', number = 10, classProbs = TRUE, summaryFunction = twoClassSummary) 

  #works with a single factor variable  (ignore warnings based on small sample size)
  train(Y ~ A, data=df[c("Y", "A")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

  #returns and error message when a single numeric independent variable is passed
  train(Y ~ B, data=df[c("Y", "B")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

  #works when a factor variable is added to the numeric variable (ignore warnings based on small sample size)
  train(Y ~ A + C, data=df[c("Y", "A", "C")], method="glmnet", 
    family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")

Solution

  • Try using this trick:

    df$ones <- rep(1, nrow(df))
    train(Y ~ ones+B, data=df[c("Y", "B", "ones")], method="glmnet", 
        family="binomial", trControl = fitControl, tuneGrid = tuneGrid, metric = "ROC")