Search code examples
pythonmachine-learningscikit-learnrandom-forestdecision-tree

How to change feature generic numbers in decision tree to its real names?


How can I change feature numbers listed below as outputed to its real feature names ? I want these feature names listed in the array. My algoritm is this:

Input :

rf = RandomForestRegressor(n_estimators=100, max_depth=3)

n_nodes = rf.estimators_[0].tree_.node_count
children_left = rf.estimators_[0].tree_.children_left
children_right = rf.estimators_[0].tree_.children_right
feature = rf.estimators_[0].tree_.feature
threshold = rf.estimators_[0].tree_.threshold

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)

is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)]  # seed is the root node id and its parent depth
while len(stack) > 0:
    node_id, parent_depth = stack.pop()
    node_depth[node_id] = parent_depth + 1

    # If we have a test node
    if (children_left[node_id] != children_right[node_id]):
        stack.append((children_left[node_id], parent_depth + 1))
        stack.append((children_right[node_id], parent_depth + 1))
    else:
        is_leaves[node_id] = True

Out:

For feature:

array([41,  0,  0, -2, -2, 55, -2, -2, 40, 45, -2, -2, 44, -2, -2], dtype=int64)

Solution

  • You might use the property feature_names_in_ of your random forest fitted estimator to access feature names

    feature_names_in_: ndarray of shape (n_features_in_,)

    Names of features seen during fit. Defined only when X has feature names that are all strings.

    together with your feature variable, namely rf.feature_names_in_[feature].

    Of course you should consider that those -2 values correspond to the case where a leaf is reached, while indexing the rf.feature_names_in_ array with negative numbers won't take that into consideration. However, you might overcome the issue by first defining the indexes where feature equals those default values

    leaves = np.where(feature == -2)[0]
    

    and exploiting them to modify the resulting array at will.

    attr = rf.feature_names_in_[feature]
    attr[leaves] = 'leaf'
    

    Here's a complete example:

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn import tree
    
    iris = load_iris(as_frame=True)
    X, y = iris.data, iris.target
    X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0)
    rf = RandomForestClassifier(n_estimators=100, max_depth=3)
    rf.fit(X_train, y_train)
    
    n_nodes = rf.estimators_[0].tree_.node_count
    children_left = rf.estimators_[0].tree_.children_left
    children_right = rf.estimators_[0].tree_.children_right
    feature = rf.estimators_[0].tree_.feature
    threshold = rf.estimators_[0].tree_.threshold
    
    node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
    
    is_leaves = np.zeros(shape=n_nodes, dtype=bool)
    stack = [(0, -1)]  # seed is the root node id and its parent depth
    while len(stack) > 0:
        node_id, parent_depth = stack.pop()
        node_depth[node_id] = parent_depth + 1
    
        # If we have a test node
        if (children_left[node_id] != children_right[node_id]):
            stack.append((children_left[node_id], parent_depth + 1))
            stack.append((children_right[node_id], parent_depth + 1))
        else:
            is_leaves[node_id] = True
    
    leaves = np.where(feature == -2)[0]
    attr = rf.feature_names_in_[feature]
    attr[leaves] = 'leaf'