rmachine-learningrandomrandom-forestrandom-seed# Why does the importance parameter influence performance of Random Forest in R?

## Fixing the random seed does not make results comparable if the experiments invoke the random number generator in different ways.

When using random forests in `R`

I came across the following situation:

```
library(randomForest)
set.seed(42)
data(iris)
rf_noImportance <- randomForest(Species~.,data=iris,ntree=100,importance=F)
print(table(predict(rf_noImportance),iris$Species))
```

Output:

`setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47`

and

```
library(randomForest)
set.seed(42)
data(iris)
rf_importance <- randomForest(Species~.,data=iris,ntree=100,importance=T)
print(table(predict(rf_importance),iris$Species))
```

Output:

`setosa versicolor virginica setosa 50 0 0 versicolor 0 47 4 virginica 0 3 46`

In the first example I set `importance = FALSE`

and in the second example `TRUE`

. From my understanding this should not affect the resulting prediction. There's also no indication for that behavior in the documentation.

According to the Cross Validated thread Do proximity or importance influence predictions by a random forest?, the importance flag should not influence the predictions, but it clearly does in the example above.

So why is the importance parameter of the `randomForest`

method influencing the performance of the model?

Solution

This is a nice example for demonstrating the **limits of reproducibility**; in short:

Let's see why this is so here...

All reproducibility studies are based on a **strong** implicit assumption: *all other being equal*; if a change between two experiments invalidates this assumption, we cannot expect reproducibility in the *deterministic* sense you are seeking here (we may of course still expect reproducibility in the *statistical* sense, but this is not the issue here).

The difference between the two cases you present here (calculating feature importance or not) is really subtle; to see why it actually violates the principle stated above, we have to dig a little, both in the documentation as well as in the source code.

The documentation of the RF `importance`

function already provides a strong hint (emphasis mine):

Here are the definitions of the variable importance measures. The first measure is computed from

permutingOOB data [...] Then the same is done afterpermutingeach predictor variable.

You may have already started becoming suspicious; such *data permutations* are normally performed in a random sense, hence potentially invoking the random number generator (RNG) for one extra process when we use `importance=TRUE`

, which process is absent in the `importance=FALSE`

case.

In other words: if in the `importance=TRUE`

case the RNG is involved in a way that is absent from the `importance=FALSE`

case, then from the first time such a thing happens in the program the two cases stop being *deterministically* comparable, regardless of the common random seed.

At this point, this may be a strong hint, but it is still only a speculation; after all, and in principle, permutations can be performed deterministically, i.e. without involving any random processes (and hence the RNG). Where is the smoking gun?

Turns out that the smoking gun exists indeed, buried in the underlying C source code used by the `randomForest`

R package; here is the relevant part of the C function `permuteOOB`

:

```
/* Permute tp */
last = nOOB;
for (i = 0; i < nOOB; ++i) {
k = (int) last * unif_rand();
tmp = tp[last - 1];
tp[last - 1] = tp[k];
tp[k] = tmp;
last--;
}
```

where we can clearly see the function `unif_rand()`

(i.e. the RNG) being invoked at line #4 of the snippet (here in the source code), in a method called only when we ask for `importance=TRUE`

, and not in the opposite case.

Arguably, and given the fact that RF is an algorithm where randomness (and hence RNG use) enters from too many points, this should be enough evidence as to why the two cases you present are indeed not expected to be *identical*, since the different use of the RNG makes the actual results diverge. On the other hand, a difference of one single misclassification among 150 samples should provide enough reassurance that the two cases are still *statistically* similar. It should be also apparent that the subtle implementation issue here (i.e. the RNG involvement) does not violate the expectation that the two results *should* be equal in theory.

- Installing R on Linux: configure: error: libcurl >= 7.28.0 library and headers are required with support for https
- How to do ensembles with time series using AICc?
- planes3d expands and draws the area based on the sphere's radius
- How to extract tag code itself using R, rvest
- How to Display or Print Contents of Environment in R
- How to use Windows user credentials for proxy authentication in R/RStudio
- R reticulate specifying python executable to use
- Replace multiple Instances of a variable name in an R function and save the modified function
- Standardizing address formatting in R
- How to fix "failed to load cairo DLL" in R?
- Using grepl to filter columns names in specific range of columns
- changing the legends in ggplot2 to have groups of similar labels
- How to keep only unique rows but ignore a column?
- convert string date to R Date FAST for all dates
- Add subgroup text to plotly pie chart
- R Shiny : adjust height of DT datatable when fillContainer=TRUE,
- Why do R external pointers' "unusual copying semantics" mean they should not be used stand-alone?
- How to extract somo character after a string with a number of word which can change in R
- What does `se` stand for in geom_smooth(..., se = FALSE)?
- How to find number of rows greater than any values in R
- Align text and reduce space between text and parentheses in plotly hover info box
- Remove outer box of geom_bar plot with broken y-axis
- How to use lag/lead in mutate with an initial value?
- Is it possible to have a Shiny ConditionalPanel whose condition is a global variable?
- counting elements in one list in another list
- How to vectorize nested loops in R?
- Replace NA values with an incrementing sequence starting from the previous non-NA value
- How can I calculate the number of uniques in a row within a species matrix?
- How to perform operations on pairs of rows, based on a "distinguishing" column's values
- Mutate variable based on previous observations