Search code examples
rpackager-caret

r caret package, error if I specified index for both rfe control and train control


I'm getting error when I specified index for rfe.control and train.control

For making glmnet rfe function I coded

glmnetFuncs <- caretFuncs #Default caret functions

glmnetFuncs$summary <-  twoClassSummary

For specifying index for rfe.control

MyRFEcontrol <- rfeControl(
  method="LGOCV",
  number=5,
  index=RFE_CV_IN,
  functions = glmnetFuncs,
  verbose = TRUE)

For specifying index for train.control

MyTrainControl=trainControl(
  method="LGOCV",
  index=indexIN,
  classProbs = TRUE,
  summaryFunction=twoClassSummary
)

Since the data size is big, I just choose random 3 columns to make sure that it works,

x=train_v_final4[,c(1,30,55)]
y=TARGET


RFE <- rfe(x=x,y=y,sizes = seq(2,3,by=1),
           metric = 'ROC',maximize=TRUE,rfeControl = MyRFEcontrol,
           method='glmnet',
          # tuneGrid = expand.grid(.alpha=c(0,0.1,1),.lambda=c(0.1,0.01,0.05)),
           trControl = MyTrainControl)

But I'm having an error saying

**model fit failed for a: alpha=0.10, lambda=3 Error in if (!all(o)) { : missing value where TRUE/FALSE needed**

I tried all other possible ways.

  1. specifying index in rfe.control and train.Control ,

  2. specifying index in rfe.control but not in train.control,

  3. specifying index in train.control but not in rfe.control.

However, non of them works. But it works fine if I use these index list in train() function. Does anyone know what I need to fix? Any comments/thoughts are much appreciated !

Details

> nearZeroVar(x[indexIN[[1]],])
integer(0) #other results (nearZeroVar(x[indexIN[[2]],])..etc...)are omitted since the             outputs are identical. 

> cor(x[indexIN[[1]],])
                         id category_q total_spent_90
id             1.0000000000  0.0300781   0.0001837173
category_q     0.0300781045  1.0000000   0.4102276754
total_spent_90 0.0001837173  0.4102277   1.0000000000

> nearZeroVar(x[RFE_CV_IN[[1]],])
integer(0)

> cor(x[RFE_CV_IN[[1]],])
                          id  category_q total_spent_90
id              1.0000000000 0.002903591  -0.0004827006
category_q      0.0029035912 1.000000000   0.9612495056
total_spent_90 -0.0004827006 0.961249506   1.0000000000


> str(RFE_CV_IN)
List of 20
 $ Resample01: int [1:28670] 8 12 35 39 47 51 55 66 71 76 ...
 $ Resample02: int [1:28670] 1 5 7 38 39 49 55 76 91 100 ...
 $ Resample03: int [1:28670] 1 5 7 8 18 30 38 39 49 63 ...
 $ Resample04: int [1:28670] 9 12 18 24 30 35 38 39 49 51 ...
 $ Resample05: int [1:28670] 8 30 47 49 51 63 71 76 77 92 ...
 $ Resample06: int [1:28670] 1 18 30 39 49 55 63 66 71 77 ...
 $ Resample07: int [1:28670] 5 18 24 25 51 76 91 101 112 116 ...
 $ Resample08: int [1:28670] 1 5 7 12 24 25 38 39 49 51 ...
 $ Resample09: int [1:28670] 8 18 24 25 38 49 51 76 101 113 ...
 ....omit rest...

> str(indexIN)
List of 20
 $ Resample01: int [1:64024] 1 6 11 12 14 15 17 19 20 22 ...
 $ Resample02: int [1:64024] 8 11 13 14 18 19 21 22 24 25 ...
 $ Resample03: int [1:64024] 1 3 4 6 11 13 14 15 16 21 ...
 $ Resample04: int [1:64024] 3 9 11 12 13 14 22 24 26 28 ...
.....omit rest

Solution

  • The problem might be that the outer function (rfe) uses the same row indicators as the original data but, once train sees the data, those row numbers don't mean the same thing.

    Suppose you have 100 data points and are doing 10-fold CV and the first fold is 1-10, the second is 11-20 etc.

    On the first fold, rfe passes rows 11-100 to train. If the index vector in train has any indices > 90, there will be an error. If not, it may run but not with the rows that you originally told train to use.

    You could do this but it will require a separate set of resample indices for each resample of the outer model (i.e. ref) since the inner data will be different each time. Also, you would need to be really careful if you do bootstrapping since it samples with replacement; if not your model building data and the holdout data could have the same exact records in them.

    If you really want reproducible/traceability, set the seed in rfeControl and trainControl. I'm pretty sure that you will get the same resamples across different runs (as long as the data set and resampling methods stay the same across runs).

    Max