I am using the following function to do cross-validation with the random forest algorithm on my dataset. However, ntree raises an error, saying that it is not used in the function. Even though I have seen that usage as a recommendation comment before in one of the threads regarding this issue, it did not work at me. Here is my code:
cv_rf_class1 <- train(y_train_u ~ ., x_train_u ,
method ="cforest",
trControl = trainControl(method = "cv",
number = 10,
verboseIter = TRUE),
ntree = 100))
If I cannot change the ntree parameter, it uses 500 trees as default in the function and it raises another error for me (subscript out of bounds), so I cannot make it work for my problem. How can I fix this issue in order to make my function work?
ntree
needs to be an argument of train
, and not of trainControl
as you have used it here; from the documentation of train
:
...
arguments passed to the classification or regression routine (such asrandomForest
). Errors will occur if values for tuning parameters are passed here.
Notice also that you are not passing the data in the correct form; train
expects the data as (x, y)
, and not as you are passing them (an incorrect combination of formula and matrices).
All in all, change your train
call to:
cv_rf_class1 <- train(x_train_u, y_train_u,
method ="cforest",
ntree = 100,
trControl = trainControl(method = "cv",
number = 10,
verboseIter = TRUE))
UPDATE (after comments)
Well, it seems that cforest
in particular will not accept an ntree
argument, because, in contrast with the original randomForest
package, this is not how you pass the number of trees in the underlying cforest
function of the respective package (docs).
The correct way, as demonstrated in the relevant examples in the caret
Github repo, is:
cv_rf_class1 <- train(x_train_u, y_train_u,
method ="cforest",
trControl = trainControl(method = "cv",
number = 10,
verboseIter = TRUE),
controls = party::cforest_unbiased(ntree = 100))
Adapting cforest.R
, we get:
library(caret)
library(plyr)
library(recipes)
library(dplyr)
model <- "cforest"
set.seed(2)
training <- twoClassSim(50, linearVars = 2)
testing <- twoClassSim(500, linearVars = 2)
trainX <- training[, -ncol(training)]
trainY <- training$Class
rec_cls <- recipe(Class ~ ., data = training) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
seeds <- vector(mode = "list", length = nrow(training) + 1)
seeds <- lapply(seeds, function(x) 1:20)
cctrl1 <- trainControl(method = "cv", number = 3, returnResamp = "all",
classProbs = TRUE,
summaryFunction = twoClassSummary,
seeds = seeds)
set.seed(849)
test_class_cv_model <- train(trainX, trainY,
method = "cforest",
trControl = cctrl1,
metric = "ROC",
preProc = c("center", "scale"),
controls = party::cforest_unbiased(ntree = 20)) # WORKS OK
test_class_pred <- predict(test_class_cv_model, testing[, -ncol(testing)])
test_class_prob <- predict(test_class_cv_model, testing[, -ncol(testing)], type = "prob")
head(test_class_pred)
# [1] Class2 Class2 Class2 Class1 Class1 Class1
# Levels: Class1 Class2
head(test_class_prob)
# Class1 Class2
# 1 0.4996686 0.5003314
# 2 0.4333222 0.5666778
# 3 0.3625118 0.6374882
# 4 0.5373396 0.4626604
# 5 0.6174159 0.3825841
# 6 0.5327283 0.4672717
Output of sessionInfo()
:
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] recipes_0.1.7 dplyr_0.8.3 plyr_1.8.4 caret_6.0-84 ggplot2_3.2.1 lattice_0.20-38