Search code examples
rr-carettraining-dataglmnet

Columns not available for when training lasso model using caret


I am getting an odd error

Error in `[.data.frame`(data, , lvls[1]) : undefined columns selected

message when I am using caret to train a glmnet model. I have used basically the same code and the same predictors for an ordinal model (just with a different factor ythen) and it worked fine. It took 400 core hours to compute so I cant show it here though).

#Source a small subset of data
source("https://gist.githubusercontent.com/FredrikKarlssonSpeech/ebd9fccf1de6789a3f529cafc496a90c/raw/efc130e41c7d01d972d1c69e59bf8f5f5fea58fa/voice.R")
trainIndex <- createDataPartition(notna$RC, p = .75, 
                                  list = FALSE, 
                                  times = 1)


training <- notna[ trainIndex[,1],] %>%
  select(RC,FCoM_envel:ATrPS_freq,`Jitter->F0_abs_dif`:RPDE)
testing  <- notna[-trainIndex[,1],] %>%
  select(RC,FCoM_envel:ATrPS_freq,`Jitter->F0_abs_dif`:RPDE)

fitControl <- trainControl(## 10-fold CV
  method = "CV",
  number = 10,
  allowParallel=TRUE,
  savePredictions="final",
  summaryFunction=twoClassSummary)

vtCVFit <- train(x=training[-1],y=training[,"RC"], 
                  method = "glmnet", 
                  trControl = fitControl,
                  preProcess=c("center", "scale"),
                  metric="Kappa"
)

I cant find anything obviously wrong with the data. No NAs

table(is.na(training))

FALSE 
43166

and dont see why it would try to index outside of the number of columns.

Any suggestions?


Solution

  • You have to remove summaryFunction=twoClassSummary in your trainControl(). It works for me.

    fitControl <- trainControl(## 10-fold CV
     method = "CV",
     number = 10,
     allowParallel=TRUE,
     savePredictions="final")
    
    vtCVFit <- train(x=training[-1],y=training[,"RC"], 
    method = "glmnet", 
     trControl = fitControl,
    preProcess=c("center", "scale"),
    metric="Kappa")
    
     print(vtCVFit)
    
    #glmnet 
    
    #113 samples
    #381 predictors
    #  2 classes: 'NVT', 'VT' 
    
    #Pre-processing: centered (381), scaled (381) 
    #Resampling: Bootstrapped (25 reps) 
    #Summary of sample sizes: 113, 113, 113, 113, 113, 113, ... 
    #Resampling results across tuning parameters:
    
    #  alpha  lambda      Accuracy   Kappa    
    #  0.10   0.01113752  0.5778182  0.1428393
    #  0.10   0.03521993  0.5778182  0.1428393
    #  0.10   0.11137520  0.5778182  0.1428393
    #  0.55   0.01113752  0.5778182  0.1428393
    #  0.55   0.03521993  0.5748248  0.1407333
    #  0.55   0.11137520  0.5749980  0.1136131
    #  1.00   0.01113752  0.5815391  0.1531280
    #  1.00   0.03521993  0.5800217  0.1361240
    #  1.00   0.11137520  0.5939621  0.1158007
    
    #Kappa was used to select the optimal model using the largest value.
    #The final values used for the model were alpha = 1 and lambda = 0.01113752.