Search code examples
pythonpandasdecision-treesklearn-pandas

Extracting rules to predict child nodes or probability scores in a Decision Tree


I am relatively new to Python implementation of Decision Tree. I am trying to extract rules to predict only child nodes and I need it to be able to predict probability scores (not just final classification) for new data and possibly transfer the algorithm to the other users. Is there an easy way to do it? I found some solutions at (How to extract the decision rules from scikit-learn decision-tree?). However, when I test them, I am not obtaining all of my child nodes for some reason (My tree is very large and deep). Any advice is appreciated. Thank you.

I have updated the first code in the link above to produce nodes and it seems to work best with the large trees. However, I am having hard time make it work with pd Dataframes. Here is example: import pandas as pd import numpy as np from sklearn.tree import DecisionTreeClassifier

dummy data:

df = pd.DataFrame({'col1':[0,1,2,3],'col2':[3,4,5,6],'dv':[0,1,0,1]})
df
# create decision tree
dt = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_leaf=1)
dt.fit(df.loc[:,('col1','col2')], df.dv)

from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print ("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print ("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print ("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print ("{}return {}".format(indent, node))

    recurse(0, 1)

tree_to_code(dt, df.columns)

Above call produces below code:

def tree(col1, col2, dv):
  if col2 <= 3.5:
    return 1
  else:  # if col2 > 3.5
    if col1 <= 1.5:
      return 3
    else:  # if col1 > 1.5
      if col1 <= 2.5:
        return 5
      else:  # if col1 > 2.5
        return 6

And, when I call above code as below I get error that I am missing one argument. How can I revise the code to make it work on pandas DataFrame?

tree('col1', 'col2', 'dv_pred')


Solution

  • Here is a working solution

    import pandas as pd
    from sklearn.tree import _tree
    from sklearn.tree import DecisionTreeClassifier
    
    df = pd.DataFrame({'col1':[0,1,2,3],'col2':[3,4,5,6],'dv':[0,1,0,1]})
    
    # create decision tree
    dt = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_leaf=1)
    features = ['col1','col2']
    dt.fit(df.loc[:,features], df.dv)
    
    
    def tree_to_code(tree, feature_names):
        tree_ = tree.tree_
        feature_name = [
            feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
            for i in tree_.feature
        ]
        print ("def tree(x):")
    
        def recurse(node, depth):
            indent = "  " * depth
            if tree_.feature[node] != _tree.TREE_UNDEFINED:
                name = feature_name[node]
                threshold = tree_.threshold[node]
                print ("{}if x['{}'] <= {}:".format(indent, name, threshold))
                recurse(tree_.children_left[node], depth + 1)
                print ("{}else:  # if x['{}'] > {}".format(indent, name, threshold))
                recurse(tree_.children_right[node], depth + 1)
            else:
                print ("{}return {}".format(indent, node))
    
        recurse(0, 1)
    
    tree_to_code(dt,  df[features].columns)
    

    Then to get the predictions

    df.apply(tree, axis=1)