Search code examples
rrandom-forestr-caretr-ranger

Hyperparameters not changing results from random forest regression trees


I am trying to tune the hyperparameters of a random forest regression model and all of the accuracy measures are exactly the same, regardless of changes to hyperparameters. I've tested the same code on the "diamonds" dataset and have been able to reproduce the problem. Here is my code:

train = diamonds[,c(1, 5, 8:10)]
x = c(1:6)
folds = sample(x,size = nrow(diamonds), replace = T)

rf_grid = expand.grid(.mtry = c(2:4),
                      .splitrule = "variance",
                      .min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
               train$carat,
               method="ranger",
               importance = "impurity",
               metric = "RMSE",
               tuneGrid = rf_grid,
               trControl = trainControl(method="cv",
                                        index=folds, 
                                        search = "random"),
               num.trees = 10,
               tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model.rds")
write.csv(results1, "sample_model.csv", row.names = FALSE)

Here's what I get for results:

enter image description here

What the heck?

UPDATE: I reduced the sample size to 1000 to allow for faster processing and got different results, still all identical to each other. Code:

train = diamonds[,c(1, 5, 8:10)]
train = train[c(1:1000),]
x = c(1:6)
folds = sample(x,size = nrow(train), replace = T)

rf_grid = expand.grid(.mtry = c(2:4),
                      .splitrule = "variance",
                      .min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
               train$carat,
               method="ranger",
               importance = "impurity",
               metric = "RMSE",
               tuneGrid = rf_grid,
               trControl = trainControl(method="cv",
                                        index=folds, 
                                        search = "random"),
               num.trees = 10,
               tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model2.rds")
write.csv(results1, "sample_model2.csv", row.names = FALSE)

Results:

enter image description here


Solution

  • This seems to be an issue with your cross-validation folds. When I run your code and look at the results of model it says:

    Summary of sample sizes: 1, 1, 1, 1, 1, 1, ...
    

    indicating that each fold only has a sample size of 1.

    I think if you define folds like this, it will work more like you're expecting it to:

    folds<-createFolds(train$carat, k = 6, returnTrain=TRUE)
    

    The results then look like this:

    Random Forest 
    
    1000 samples
       4 predictor
    
    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    Summary of sample sizes: 832, 833, 835, 834, 834, 832, ... 
    Resampling results across tuning parameters:
    
      mtry  RMSE        Rsquared   MAE       
      2     0.01582362  0.9933839  0.00985451
      3     0.01601980  0.9932625  0.00994588
      4     0.01567161  0.9935624  0.01018242
    
    Tuning parameter 'splitrule' was held constant at a value
     of variance
    Tuning parameter 'min.node.size' was held constant
     at a value of 20
    RMSE was used to select the optimal model using the smallest value.
    The final values used for the model were mtry = 4, splitrule
     = variance and min.node.size = 20.