Search code examples
machine-learningrandom-forestdecision-tree

What is bootstrap dataset in random forest?


Random forests train multiple CARTs on bootstrapped samples of training data. Few places say that the bootstrapped sample has a subset of original features (like this) and few of the places say that the bootstrapped sample has all the original features and the feature samplign happens at each node from a set of all the features in the original training data. Most of the resources don't touch this point and are mostly a copy paste form each other

Can you tell which of the following two is part of the random forest algorithm

Lets us say my original set of features is S.

  1. I subset from S for each node of every tree.
  2. I randomly subset S1 for a given tree and then subset from S1 for every node of that particular tree.

Which one (1 or 2) is it?


Solution

  • Both, the documentations of scikit-learn's RandomForestClassifier (link) as well as RandomForestRegressor (link) refer to Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.

    Breiman writes

    “… random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on.”

    So it is the first of your choices.

    Have a look at this thread for the background: In Random Forest, why is a random subset of features chosen at the node level rather than at the tree level?