Search code examples
python-3.xgraphvizdecision-tree

Python - export the final random forests tree for Graphviz


I have a Python code with a decision tree and random forests. The decision tree finds the biggest contributor using:

contr = decisiontree.feature_importances_.max()  * 100
contr_full = decisiontree.feature_importances_  * 100

#Showing name
location = pd.to_numeric(np.where(contr_full == contr)[0][0])
result = list(df_dmy)[location + 1]

This returns the biggest contributor in my dataset and is then exported to a Graphviz format using:

tree.export_graphviz(rpart, out_file=path_file + '\\Decision Tree Code for Graphviz.dot', filled=True, 
                 feature_names=list(df_dmy.drop(['Reason of Removal'], axis=1).columns), 
                         impurity=False, label=None, proportion=True, 
                         class_names=['Unscheduled', 'Scheduled'], rounded=True)

In the case of random forests, I have managed to export every tree that is used there (100 trees):

i = 0
for tree_data in rf.estimators_:
with open('tree_' + str(i) + '.dot', 'w') as my_file:
    my_file = tree.export_graphviz(tree_data , out_file = my_file)
i = i + 1

This, of course, generates 100 word files with the different trees. Not every tree however contains the information that is needed, since some trees show a different result. I do know the biggest contributor of the classifier, but I also want to see the decision tree with that result.

What I tried was:

i= 0
for tree_data in rf.estimators_:
#Feature importance
df_trees = tree_data.tree_.threshold

contr = df_trees.max()  * 100
contr_full = df_trees * 100

#Showing name
location = pd.to_numeric(np.where(contr_full == contr)[0][0])
result = print(list(df_dmy)[location + 1])

Using this, I get the error: IndexError: list index out of range for which I have no idea what is wrong here.

I wanted a dataframe of biggest contributors together with their contributing factors in order to filter this to the actual biggest contributor and biggest contribution. See example:

Result (in a dataframe) =

    Result   Contribution
0   Car      0.74
1   Bike     0.71
2   Car      0.79

Python knows already that the result from random forests gave 'car' as the biggest contributor, the first filter is to remove everything except 'car':

Result   Contribution
0   Car      0.74
2   Car      0.79

Then it has to search for the highest contribution and retrieve the index.

    Result   Contribution
2   Car      0.79

Then it has to export the tree information corresponding to that index.

I know it is quite a long story, but I hope someone knows how to finish this code.

Regards, Ganesh


Solution

  • names = []
    contributors = []
    
    df = pd.DataFrame(columns=['Parameter', 'Value'])
    
    for tree_data in rf.estimators_:
        #Feature importance
        df_trees = tree_data.tree_.threshold
    
        contr = tree_data.feature_importances_.max()  * 100
        contr_full = tree_data.feature_importances_ * 100
    
        contr_location = pd.to_numeric(np.where(contr_full == contr)[0][0])
        names.append(list(titanic_dmy.columns)[contr_location + 1])
        contributors.append(contr)
    
    df['Parameter']=np.array(names)
    df['Value']=np.array(contributors)
    idx = df.index[df['Value'] == df['Value'].loc[df['Value'].idxmax()]].tolist()[0]
    
    #Export to Graphviz
    tree.export_graphviz(rf.estimators_[idx], out_file=path_file + '\\RF Decision Tree for Graphviz.dot', 
                         filled=True, max_depth=graphviz_leafs, feature_names=list(titanic_dmy.drop(['survived'], 
                         axis=1).columns), impurity=False, label=None, proportion=True, 
                         class_names=['Unscheduled', 'Scheduled'], rounded=True, precision=2)