rmachine-learningrandomrandom-forestrandom-seed

Why does the importance parameter influence performance of Random Forest in R?


When using random forests in R I came across the following situation:

library(randomForest)
set.seed(42)
data(iris)
rf_noImportance <- randomForest(Species~.,data=iris,ntree=100,importance=F)
print(table(predict(rf_noImportance),iris$Species))

Output:

            setosa versicolor virginica
 setosa         50          0         0
 versicolor      0         47         3
 virginica       0          3        47

and

library(randomForest)
set.seed(42)
data(iris)  
rf_importance <- randomForest(Species~.,data=iris,ntree=100,importance=T)
print(table(predict(rf_importance),iris$Species))

Output:

            setosa versicolor virginica
 setosa         50          0         0
 versicolor      0         47         4
 virginica       0          3        46

In the first example I set importance = FALSE and in the second example TRUE. From my understanding this should not affect the resulting prediction. There's also no indication for that behavior in the documentation.

According to the Cross Validated thread Do proximity or importance influence predictions by a random forest?, the importance flag should not influence the predictions, but it clearly does in the example above.

So why is the importance parameter of the randomForest method influencing the performance of the model?


Solution

  • This is a nice example for demonstrating the limits of reproducibility; in short:

    Fixing the random seed does not make results comparable if the experiments invoke the random number generator in different ways.

    Let's see why this is so here...

    All reproducibility studies are based on a strong implicit assumption: all other being equal; if a change between two experiments invalidates this assumption, we cannot expect reproducibility in the deterministic sense you are seeking here (we may of course still expect reproducibility in the statistical sense, but this is not the issue here).

    The difference between the two cases you present here (calculating feature importance or not) is really subtle; to see why it actually violates the principle stated above, we have to dig a little, both in the documentation as well as in the source code.

    The documentation of the RF importance function already provides a strong hint (emphasis mine):

    Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data [...] Then the same is done after permuting each predictor variable.

    You may have already started becoming suspicious; such data permutations are normally performed in a random sense, hence potentially invoking the random number generator (RNG) for one extra process when we use importance=TRUE, which process is absent in the importance=FALSE case.

    In other words: if in the importance=TRUE case the RNG is involved in a way that is absent from the importance=FALSE case, then from the first time such a thing happens in the program the two cases stop being deterministically comparable, regardless of the common random seed.

    At this point, this may be a strong hint, but it is still only a speculation; after all, and in principle, permutations can be performed deterministically, i.e. without involving any random processes (and hence the RNG). Where is the smoking gun?

    Turns out that the smoking gun exists indeed, buried in the underlying C source code used by the randomForest R package; here is the relevant part of the C function permuteOOB:

    /* Permute tp */
        last = nOOB;
        for (i = 0; i < nOOB; ++i) {
            k = (int) last * unif_rand();
            tmp = tp[last - 1];
            tp[last - 1] = tp[k];
            tp[k] = tmp;
            last--;
        }
    

    where we can clearly see the function unif_rand() (i.e. the RNG) being invoked at line #4 of the snippet (here in the source code), in a method called only when we ask for importance=TRUE, and not in the opposite case.

    Arguably, and given the fact that RF is an algorithm where randomness (and hence RNG use) enters from too many points, this should be enough evidence as to why the two cases you present are indeed not expected to be identical, since the different use of the RNG makes the actual results diverge. On the other hand, a difference of one single misclassification among 150 samples should provide enough reassurance that the two cases are still statistically similar. It should be also apparent that the subtle implementation issue here (i.e. the RNG involvement) does not violate the expectation that the two results should be equal in theory.