Search code examples
pythonscikit-learnvisualizationdtreeviz

Use dtreeviz to visualize decision tree


I love the decision tree visualisations available from Dtreeviz library - GitHub , and can duplicate this using

# Install libraries
!pip install dtreeviz
!apt-get install graphviz

# Sample code
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *
from IPython.core.display import display, HTML

classifier = tree.DecisionTreeClassifier(max_depth=4)
cancer = load_breast_cancer()

classifier.fit(cancer.data, cancer.target)
viz = dtreeviz(classifier,
               cancer.data,
               cancer.target,
               target_name='cancer',
               feature_names=cancer.feature_names, 
               class_names=["malignant", "benign"],
               fancy=False) 

display(HTML(viz.svg()))

However, when I apply the above to a dtree I made myself, the code bombs out because my data is in a pandas DF (or a np array), not a scikit-learn bunch object.

Now, over at Sci-kit learn - How to create a Bunch object they tell me pretty sternly not to try to create a bunch object; but I also do not have the skills to convert my DF or NP array to something that the viz function, above, will accept.

We can suppose my DF has nine features and a target, called 'Feature01', 'Feature02', etc and 'Target01'.

This I would normally split thusly

FeatDF  = FullDF.drop( columns = ["Target01"])
LabelDF = FullDF["Target01"]

and then set on my merry way to assign a classifier, or if for ML, create a test/train split.

None of this is helpful when calling dtreeviz - which is expecting things like "feature_names" (which I take is something included in the "bunch" object). And since I can't convert my DF to a bunch, I'm very much stuck. Oh bring your wisdom, please.

Update: I guess any simple DF would illustrate my conundrum. We could just swing with

import pandas as pd

Things = {'Feature01': [3,4,5,0], 
          'Feature02': [4,5,6,0], 
          'Feature03': [1,2,3,8], 
          'Target01': ['Red','Blue','Teal','Red']}
DF = pd.DataFrame(Things,
                  columns= ['Feature01', 'Feature02', 
                            'Feature02', 'Target01']) 

as an example DF. Now, would I then go

DataNP = DF.to_numpy()
classifier.fit(DF.data, DF.target)
feature_names = ['Feature01', 'Feature02', 'Feature03'] 
#..and what if I have 50 features...

viz = dtreeviz(classifier,
               DF.data,
               DF.target,
               target_name='Target01',
               feature_names=feature_names, 
               class_names=["Red", "Blue", "Teal"],
               fancy=False) 

or is this daft? Thanks for the guidance so far!


Solution

    • sklearn's decision tree needs numerical target values
    • You can use sklearn's LabelEncoder to transform your strings to integers

      from sklearn import preprocessing
      
      label_encoder = preprocessing.LabelEncoder()
      label_encoder.fit(df.Target01)
      
      df['target'] = label_encoder.transform(df.Target01)
      
    • dtreeviz expects the class_names to be a list or dict, so let's get it from our label_encoder

      class_names = list(label_encoder.classes_)        
      

    Complete code

    import pandas as pd
    from sklearn import preprocessing, tree
    from dtreeviz.trees import dtreeviz
    
    Things = {'Feature01': [3,4,5,0], 
              'Feature02': [4,5,6,0], 
              'Feature03': [1,2,3,8], 
              'Target01': ['Red','Blue','Teal','Red']}
    df = pd.DataFrame(Things,
                      columns= ['Feature01', 'Feature02', 
                                'Feature02', 'Target01']) 
    
    label_encoder = preprocessing.LabelEncoder()
    label_encoder.fit(df.Target01)
    df['target'] = label_encoder.transform(df.Target01)
    
    classifier = tree.DecisionTreeClassifier()
    classifier.fit(df.iloc[:,:3], df.target)
    
    dtreeviz(classifier,
             df.iloc[:,:3],
             df.target,
             target_name='toy',
             feature_names=df.columns[0:3],
             class_names=list(label_encoder.classes_)
             )
    

    enter image description here


    Old answer

    Let's use the cancer dataset to create a Pandas dataframe

    df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
    df['target'] = cancer.target
    

    which gives us the following dataframe.

    mean radius mean texture    mean perimeter  mean area   mean smoothness mean compactness    mean concavity  mean concave points mean symmetry   mean fractal dimension  radius error    texture error   perimeter error area error  smoothness error    compactness error   concavity error concave points error    symmetry error  fractal dimension error worst radius    worst texture   worst perimeter worst area  worst smoothness    worst compactness   worst concavity worst concave points    worst symmetry  worst fractal dimension target
    0   17.99   10.38   122.8   1001.0  0.1184  0.2776  0.3001  0.1471  0.2419  0.07871 1.095   0.9053  8.589   153.4   0.006399    0.04904 0.05373 0.01587 0.03003 0.006193    25.38   17.33   184.6   2019.0  0.1622  0.6656  0.7119  0.2654  0.4601  0.1189  0
    1   20.57   17.77   132.9   1326.0  0.08474 0.07864 0.0869  0.07017 0.1812  0.05667 0.5435  0.7339  3.398   74.08   0.005225    0.01308 0.0186  0.0134  0.01389 0.003532    24.99   23.41   158.8   1956.0  0.1238  0.1866  0.2416  0.186   0.275   0.08902 0
    2   19.69   21.25   130.0   1203.0  0.1096  0.1599  0.1974  0.1279  0.2069  0.05999 0.7456  0.7869  4.585   94.03   0.00615 0.04006 0.03832 0.02058 0.0225  0.004571    23.57   25.53   152.5   1709.0  0.1444  0.4245  0.4504  0.243   0.3613  0.08758 0
    [...]
    568 7.76    24.54   47.92   181.0   0.05263 0.04362 0.0 0.0 0.1587  0.05884 0.3857  1.428   2.548   19.15   0.007189    0.00466 0.0 0.0 0.02676 0.002783    9.456   30.37   59.16   268.6   0.08996 0.06444 0.0 0.0 0.2871  0.07039 1
    

    and for your classifier it can be used in the following way.

    classifier.fit(df.iloc[:,:-1], df.target)
    

    i.e. just take all but the last column as training/input and the target column as the output/target.

    The same for the visualization:

    viz = dtreeviz(classifier,
                   df.iloc[:,:-1],
                   df.target,
                   target_name='cancer',
                   feature_names=df.columns[0:-1],
                   class_names=["malignant", "benign"])