I'm getting error when I specified index for rfe.control and train.control
For making glmnet rfe function I coded
glmnetFuncs <- caretFuncs #Default caret functions
glmnetFuncs$summary <- twoClassSummary
For specifying index for rfe.control
MyRFEcontrol <- rfeControl(
method="LGOCV",
number=5,
index=RFE_CV_IN,
functions = glmnetFuncs,
verbose = TRUE)
For specifying index for train.control
MyTrainControl=trainControl(
method="LGOCV",
index=indexIN,
classProbs = TRUE,
summaryFunction=twoClassSummary
)
Since the data size is big, I just choose random 3 columns to make sure that it works,
x=train_v_final4[,c(1,30,55)]
y=TARGET
RFE <- rfe(x=x,y=y,sizes = seq(2,3,by=1),
metric = 'ROC',maximize=TRUE,rfeControl = MyRFEcontrol,
method='glmnet',
# tuneGrid = expand.grid(.alpha=c(0,0.1,1),.lambda=c(0.1,0.01,0.05)),
trControl = MyTrainControl)
But I'm having an error saying
**model fit failed for a: alpha=0.10, lambda=3 Error in if (!all(o)) { : missing value where TRUE/FALSE needed**
I tried all other possible ways.
specifying index in rfe.control and train.Control ,
specifying index in rfe.control but not in train.control,
specifying index in train.control but not in rfe.control.
However, non of them works. But it works fine if I use these index list in train() function. Does anyone know what I need to fix? Any comments/thoughts are much appreciated !
Details
> nearZeroVar(x[indexIN[[1]],])
integer(0) #other results (nearZeroVar(x[indexIN[[2]],])..etc...)are omitted since the outputs are identical.
> cor(x[indexIN[[1]],])
id category_q total_spent_90
id 1.0000000000 0.0300781 0.0001837173
category_q 0.0300781045 1.0000000 0.4102276754
total_spent_90 0.0001837173 0.4102277 1.0000000000
> nearZeroVar(x[RFE_CV_IN[[1]],])
integer(0)
> cor(x[RFE_CV_IN[[1]],])
id category_q total_spent_90
id 1.0000000000 0.002903591 -0.0004827006
category_q 0.0029035912 1.000000000 0.9612495056
total_spent_90 -0.0004827006 0.961249506 1.0000000000
> str(RFE_CV_IN)
List of 20
$ Resample01: int [1:28670] 8 12 35 39 47 51 55 66 71 76 ...
$ Resample02: int [1:28670] 1 5 7 38 39 49 55 76 91 100 ...
$ Resample03: int [1:28670] 1 5 7 8 18 30 38 39 49 63 ...
$ Resample04: int [1:28670] 9 12 18 24 30 35 38 39 49 51 ...
$ Resample05: int [1:28670] 8 30 47 49 51 63 71 76 77 92 ...
$ Resample06: int [1:28670] 1 18 30 39 49 55 63 66 71 77 ...
$ Resample07: int [1:28670] 5 18 24 25 51 76 91 101 112 116 ...
$ Resample08: int [1:28670] 1 5 7 12 24 25 38 39 49 51 ...
$ Resample09: int [1:28670] 8 18 24 25 38 49 51 76 101 113 ...
....omit rest...
> str(indexIN)
List of 20
$ Resample01: int [1:64024] 1 6 11 12 14 15 17 19 20 22 ...
$ Resample02: int [1:64024] 8 11 13 14 18 19 21 22 24 25 ...
$ Resample03: int [1:64024] 1 3 4 6 11 13 14 15 16 21 ...
$ Resample04: int [1:64024] 3 9 11 12 13 14 22 24 26 28 ...
.....omit rest
The problem might be that the outer function (rfe
) uses the same row indicators as the original data but, once train
sees the data, those row numbers don't mean the same thing.
Suppose you have 100 data points and are doing 10-fold CV and the first fold is 1-10, the second is 11-20 etc.
On the first fold, rfe
passes rows 11-100 to train
. If the index
vector in train
has any indices > 90, there will be an error. If not, it may run but not with the rows that you originally told train
to use.
You could do this but it will require a separate set of resample indices for each resample of the outer model (i.e. ref
) since the inner data will be different each time. Also, you would need to be really careful if you do bootstrapping since it samples with replacement; if not your model building data and the holdout data could have the same exact records in them.
If you really want reproducible/traceability, set the seed in rfeControl
and trainControl
. I'm pretty sure that you will get the same resamples across different runs (as long as the data set and resampling methods stay the same across runs).
Max