python machine-learning scikit-learn random-forest decision-tree

How to change feature generic numbers in decision tree to its real names?

How can I change feature numbers listed below as outputed to its real feature names ? I want these feature names listed in the array. My algoritm is this:

Input :

rf = RandomForestRegressor(n_estimators=100, max_depth=3)

n_nodes = rf.estimators_[0].tree_.node_count
children_left = rf.estimators_[0].tree_.children_left
children_right = rf.estimators_[0].tree_.children_right
feature = rf.estimators_[0].tree_.feature
threshold = rf.estimators_[0].tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)

is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

Out:

For feature:

array([41,  0,  0, -2, -2, 55, -2, -2, 40, 45, -2, -2, 44, -2, -2], dtype=int64)

Solution

You might use the property feature_names_in_ of your random forest fitted estimator to access feature names

feature_names_in_: ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

together with your feature variable, namely rf.feature_names_in_[feature].

Of course you should consider that those -2 values correspond to the case where a leaf is reached, while indexing the rf.feature_names_in_ array with negative numbers won't take that into consideration. However, you might overcome the issue by first defining the indexes where feature equals those default values

leaves = np.where(feature == -2)[0]

and exploiting them to modify the resulting array at will.

attr = rf.feature_names_in_[feature]
attr[leaves] = 'leaf'

Here's a complete example:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree

iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0)
rf = RandomForestClassifier(n_estimators=100, max_depth=3)
rf.fit(X_train, y_train)

n_nodes = rf.estimators_[0].tree_.node_count
children_left = rf.estimators_[0].tree_.children_left
children_right = rf.estimators_[0].tree_.children_right
feature = rf.estimators_[0].tree_.feature
threshold = rf.estimators_[0].tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)

is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

leaves = np.where(feature == -2)[0]
attr = rf.feature_names_in_[feature]
attr[leaves] = 'leaf'