Search code examples
rrandom-forestrpart

How to create Random Forest from scratch in R (without the randomforest package)


This is the way I want to use Random Forest by using the RandomForest Package:

library (randomForest)
rf1 <- randomForest(CLA ~ ., dat, ntree=100, norm.votes=FALSE)
p1 <- predict(rf1, testing, type='response')
confMat_rf1 <- table(p1,testing_CLA$CLA)
accuracy_rf1 <- sum(diag(confMat_rf1))/sum(confMat_rf1)

I don't want to use the RandomForest Package at all. Given a dataset (dat) and using rpart and default values of randomforest package, how can I get the same results? For instance, for the 100 decision trees, I need to run the following:

for(i in 1:100){
cart.models[[i]]<-rpart(CLA~ ., data = random_dataset[[i]],cp=-1)
} 

Where each random_dataset[[i]] would be randomly chosen default number of attributes and rows. In addition, does rpart used for randomforest?


Solution

  • It is possible to simulate training a random forest by training multiple trees using rpart and bootstrap samples on the training set and the features of the training set. The following code snippet trains 10 trees to classify the iris species and returns a list of trees with the out of bag accuracy on each tree.

    library(rpart)
    library(Metrics)
    library(doParallel)
    library(foreach)
    library(ggplot2)
    
    
    random_forest <- function(train_data, train_formula, method="class", feature_per=0.7, cp=0.01, min_split=20, min_bucket=round(min_split/3), max_depth=30, ntrees = 10) {
    
      target_variable <- as.character(train_formula)[[2]]
      features <- setdiff(colnames(train_data), target_variable)
      n_features <- length(features)
    
      ncores <- detectCores(logical=FALSE)
      cl <- makeCluster(ncores)
      registerDoParallel(cl)
    
      rf_model <- foreach(
        icount(ntrees),
        .packages = c("rpart", "Metrics")
      ) %dopar% {
        bagged_features <- sample(features, n_features * feature_per, replace = FALSE)
        index_bag <- sample(nrow(train_data), replace=TRUE)
        in_train_bag <- train_data[index_bag,]
        out_train_bag <- train_data[-index_bag,]
        trControl <- rpart.control(minsplit = min_split, minbucket = min_bucket, cp = cp, maxdepth = max_depth)
        tree <- rpart(formula = train_formula, 
                      data = in_train_bag, 
                      control = trControl)
    
        oob_pred <- predict(tree, newdata = out_train_bag, type = "class")
        oob_acc <- accuracy(actual = out_train_bag[, target_variable], predicted = oob_pred)
    
        list(tree=tree, oob_perf=oob_acc)
      }
    
      stopCluster(cl)
    
      rf_model
    
    }
    
    train_formula <- as.formula("Species ~ .")
    forest <- random_forest(train_data = iris, train_formula = train_formula)