machine-learning scikit-learn random-forest data-science ensemble-learning

How feature importance and forest structures are related in scikit-learn RandomForestClassifier?

Here is a simple example of my problem, using the Iris dataset. I am puzzled when trying to understand how feature importance are computed and how this is visible when visualizing the forest of estimators using export_graphviz. Here is my code:

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

data = load_iris()
X = pd.DataFrame(data=data.data,columns=['sepallength', 'sepalwidth', 'petallength','petalwidth'])
y = pd.DataFrame(data=data.target)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=2,max_depth=1)
rf.fit(X_train,y_train.iloc[:,0])

The classifier performs poorly (the score is 0.68) since the forest contains 2 trees with depth of 1. Anyway this doesn't matter here.

The feature importance are retrieved as follow:

importances = rf.feature_importances_
std = np.std([rf.feature_importances_ for tree in rf.estimators_],axis=0)
indices = np.argsort(importances)[::-1]

print("Feature ranking:")
for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[indices[f]]))

and the output is :

Feature ranking:
1. feature sepallength (1.000000)
2. feature sepalwidth (0.000000)
3. feature petallength (0.000000)
4. feature petalwidth (0.000000)

Now when showing the structure of trees that are built using the following code:

from sklearn.tree import export_graphviz
export_graphviz(rf.estimators_[0],
                feature_names=X.columns,
                filled=True,
                rounded=True)
!dot -Tpng tree.dot -o tree0.png
from IPython.display import Image
Image('tree0.png')

I obtain these 2 figures

export of tree #0:

export of tree #1:

I cannot understand how sepallength can have importance=1 but not be used for node splitting in both trees (only petallength is used) as shown in the figure.

Solution

You have a bug in

for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[indices[f]]))

If you permute with indices = np.argsort(importances)[::-1], then you need to permute everything - not keep the labels according to one ordering, and the importances according to a different ordering.

If you replace the above by

for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[f]))

then the forest and its trees are all in agreement that the feature at index 2 is the only one with any importance.