I love the decision tree visualisations available from Dtreeviz library - GitHub , and can duplicate this using
# Install libraries
!pip install dtreeviz
!apt-get install graphviz
# Sample code
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *
from IPython.core.display import display, HTML
classifier = tree.DecisionTreeClassifier(max_depth=4)
cancer = load_breast_cancer()
classifier.fit(cancer.data, cancer.target)
viz = dtreeviz(classifier,
cancer.data,
cancer.target,
target_name='cancer',
feature_names=cancer.feature_names,
class_names=["malignant", "benign"],
fancy=False)
display(HTML(viz.svg()))
However, when I apply the above to a dtree I made myself, the code bombs out because my data is in a pandas DF (or a np array), not a scikit-learn bunch object.
Now, over at Sci-kit learn - How to create a Bunch object they tell me pretty sternly not to try to create a bunch object; but I also do not have the skills to convert my DF or NP array to something that the viz function, above, will accept.
We can suppose my DF has nine features and a target, called 'Feature01', 'Feature02', etc and 'Target01'.
This I would normally split thusly
FeatDF = FullDF.drop( columns = ["Target01"])
LabelDF = FullDF["Target01"]
and then set on my merry way to assign a classifier, or if for ML, create a test/train split.
None of this is helpful when calling dtreeviz
- which is expecting things like "feature_names" (which I take is something included in the "bunch" object). And since I can't convert my DF to a bunch, I'm very much stuck. Oh bring your wisdom, please.
Update: I guess any simple DF would illustrate my conundrum. We could just swing with
import pandas as pd
Things = {'Feature01': [3,4,5,0],
'Feature02': [4,5,6,0],
'Feature03': [1,2,3,8],
'Target01': ['Red','Blue','Teal','Red']}
DF = pd.DataFrame(Things,
columns= ['Feature01', 'Feature02',
'Feature02', 'Target01'])
as an example DF. Now, would I then go
DataNP = DF.to_numpy()
classifier.fit(DF.data, DF.target)
feature_names = ['Feature01', 'Feature02', 'Feature03']
#..and what if I have 50 features...
viz = dtreeviz(classifier,
DF.data,
DF.target,
target_name='Target01',
feature_names=feature_names,
class_names=["Red", "Blue", "Teal"],
fancy=False)
or is this daft? Thanks for the guidance so far!
You can use sklearn's LabelEncoder
to transform your strings to integers
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(df.Target01)
df['target'] = label_encoder.transform(df.Target01)
dtreeviz
expects the class_names
to be a list
or dict
, so let's get it from our label_encoder
class_names = list(label_encoder.classes_)
Complete code
import pandas as pd
from sklearn import preprocessing, tree
from dtreeviz.trees import dtreeviz
Things = {'Feature01': [3,4,5,0],
'Feature02': [4,5,6,0],
'Feature03': [1,2,3,8],
'Target01': ['Red','Blue','Teal','Red']}
df = pd.DataFrame(Things,
columns= ['Feature01', 'Feature02',
'Feature02', 'Target01'])
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(df.Target01)
df['target'] = label_encoder.transform(df.Target01)
classifier = tree.DecisionTreeClassifier()
classifier.fit(df.iloc[:,:3], df.target)
dtreeviz(classifier,
df.iloc[:,:3],
df.target,
target_name='toy',
feature_names=df.columns[0:3],
class_names=list(label_encoder.classes_)
)
Old answer
Let's use the cancer dataset to create a Pandas dataframe
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target
which gives us the following dataframe.
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.8 1001.0 0.1184 0.2776 0.3001 0.1471 0.2419 0.07871 1.095 0.9053 8.589 153.4 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.1189 0
1 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.0186 0.0134 0.01389 0.003532 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.186 0.275 0.08902 0
2 19.69 21.25 130.0 1203.0 0.1096 0.1599 0.1974 0.1279 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.00615 0.04006 0.03832 0.02058 0.0225 0.004571 23.57 25.53 152.5 1709.0 0.1444 0.4245 0.4504 0.243 0.3613 0.08758 0
[...]
568 7.76 24.54 47.92 181.0 0.05263 0.04362 0.0 0.0 0.1587 0.05884 0.3857 1.428 2.548 19.15 0.007189 0.00466 0.0 0.0 0.02676 0.002783 9.456 30.37 59.16 268.6 0.08996 0.06444 0.0 0.0 0.2871 0.07039 1
and for your classifier it can be used in the following way.
classifier.fit(df.iloc[:,:-1], df.target)
i.e. just take all but the last column as training/input and the target
column as the output/target.
The same for the visualization:
viz = dtreeviz(classifier,
df.iloc[:,:-1],
df.target,
target_name='cancer',
feature_names=df.columns[0:-1],
class_names=["malignant", "benign"])