machine-learning random-forest decision-tree

What is bootstrap dataset in random forest?

Random forests train multiple CARTs on bootstrapped samples of training data. Few places say that the bootstrapped sample has a subset of original features (like this) and few of the places say that the bootstrapped sample has all the original features and the feature samplign happens at each node from a set of all the features in the original training data. Most of the resources don't touch this point and are mostly a copy paste form each other

Can you tell which of the following two is part of the random forest algorithm

Lets us say my original set of features is S.

I subset from S for each node of every tree.
I randomly subset S1 for a given tree and then subset from S1 for every node of that particular tree.

Which one (1 or 2) is it?

Solution

Both, the documentations of scikit-learn's RandomForestClassifier (link) as well as RandomForestRegressor (link) refer to Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.

Breiman writes

“… random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on.”

So it is the first of your choices.

Have a look at this thread for the background: In Random Forest, why is a random subset of features chosen at the node level rather than at the tree level?