Search code examples
pythonscikit-learnrandom-forestsubsampling

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?


In the documentation of SciKit-Learn Random Forest classifier , it is stated that

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training.

Am I missing something here?


Solution

  • I believe this part of docs answers your question

    In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

    The key to understanding is in "sample drawn with replacement". This means that each instance can be drawn more than once. This in turn means, that some instances in the train set are present several times and some are not present at all (out-of-bag). Those are different for different trees