I have a Python code with a decision tree and random forests. The decision tree finds the biggest contributor using:
contr = decisiontree.feature_importances_.max() * 100
contr_full = decisiontree.feature_importances_ * 100
#Showing name
location = pd.to_numeric(np.where(contr_full == contr)[0][0])
result = list(df_dmy)[location + 1]
This returns the biggest contributor in my dataset and is then exported to a Graphviz format using:
tree.export_graphviz(rpart, out_file=path_file + '\\Decision Tree Code for Graphviz.dot', filled=True,
feature_names=list(df_dmy.drop(['Reason of Removal'], axis=1).columns),
impurity=False, label=None, proportion=True,
class_names=['Unscheduled', 'Scheduled'], rounded=True)
In the case of random forests, I have managed to export every tree that is used there (100 trees):
i = 0
for tree_data in rf.estimators_:
with open('tree_' + str(i) + '.dot', 'w') as my_file:
my_file = tree.export_graphviz(tree_data , out_file = my_file)
i = i + 1
This, of course, generates 100 word files with the different trees. Not every tree however contains the information that is needed, since some trees show a different result. I do know the biggest contributor of the classifier, but I also want to see the decision tree with that result.
What I tried was:
i= 0
for tree_data in rf.estimators_:
#Feature importance
df_trees = tree_data.tree_.threshold
contr = df_trees.max() * 100
contr_full = df_trees * 100
#Showing name
location = pd.to_numeric(np.where(contr_full == contr)[0][0])
result = print(list(df_dmy)[location + 1])
Using this, I get the error: IndexError: list index out of range for which I have no idea what is wrong here.
I wanted a dataframe of biggest contributors together with their contributing factors in order to filter this to the actual biggest contributor and biggest contribution. See example:
Result (in a dataframe) =
Result Contribution
0 Car 0.74
1 Bike 0.71
2 Car 0.79
Python knows already that the result from random forests gave 'car' as the biggest contributor, the first filter is to remove everything except 'car':
Result Contribution
0 Car 0.74
2 Car 0.79
Then it has to search for the highest contribution and retrieve the index.
Result Contribution
2 Car 0.79
Then it has to export the tree information corresponding to that index.
I know it is quite a long story, but I hope someone knows how to finish this code.
Regards, Ganesh
names = []
contributors = []
df = pd.DataFrame(columns=['Parameter', 'Value'])
for tree_data in rf.estimators_:
#Feature importance
df_trees = tree_data.tree_.threshold
contr = tree_data.feature_importances_.max() * 100
contr_full = tree_data.feature_importances_ * 100
contr_location = pd.to_numeric(np.where(contr_full == contr)[0][0])
names.append(list(titanic_dmy.columns)[contr_location + 1])
contributors.append(contr)
df['Parameter']=np.array(names)
df['Value']=np.array(contributors)
idx = df.index[df['Value'] == df['Value'].loc[df['Value'].idxmax()]].tolist()[0]
#Export to Graphviz
tree.export_graphviz(rf.estimators_[idx], out_file=path_file + '\\RF Decision Tree for Graphviz.dot',
filled=True, max_depth=graphviz_leafs, feature_names=list(titanic_dmy.drop(['survived'],
axis=1).columns), impurity=False, label=None, proportion=True,
class_names=['Unscheduled', 'Scheduled'], rounded=True, precision=2)