Search code examples
pythondecision-tree

Resampling leads to strange, non-binary thresholds in a Decision Tree


I first created a Decision Tree (DT) without resampling. The outcome was e.g. like this: DT BEFORE Resampling Here, binary leaf values are "<= 0.5" and therefore completely comprehensible, how to interpret the decision boundary. As a note: Binary attributes are those, which were strings/non-integers at the beginning and then converted into dummies with get_dummies.

But when I conduct resampling with SMOTE, my tree looks as follows: [DT AFTER Resampling][(https://i.sstatic.net/YTL0I.png)]2 The decision boundaries are not 0,5 anymore, but now have strange values. It seems that with the newly generated datasets by SMOTE, there are no "0" and "1" anymore in the underlying dataset. As a note: For "Value" attributes, those "continuous" values are ok, since they are non-binary, because they were integers since the beginning.

Here is my code for the DT generation (altered):

X1 = dummies.drop(target1, axis="columns") #input variables
Y1 = dummies[target1] #target variable

X1_train, X1_test, Y1_train, Y1_test = train_test_split(X1, Y1, test_size= 0.3)

clf1 = DecisionTreeClassifier(max_depth = 3) 
clf1 = clf1.fit(X1_train, Y1_train)
params1 = clf1.get_params()
        
preds1 = clf1.predict(X1_train)
        
clf1.predict_proba(X1_train)

feature_names1 = X1.columns
clf1.feature_importances_

fig1 = plt.figure(figsize = (25,20))
_ = tree.plot_tree(clf1,
                  feature_names = feature_names1,
                  class_names = {0: "No Issue", 
                                 1: "Issue"},
                  filled = True,
                  fontsize = 12)

Here you see how the Resampling was done in the particular case:

sm = SMOTE(random_state=42)
X_res, Y_res = sm.fit_resample(X_train, Y_train)

"X_res" and "Y_res" were then used in place of "X1_train" and "Y1_train" as above.

Do you know any mechanism to handle the inconvenience stated in the DT pictures above?

I would really appreciate some help! Thank you very much and let me now whether you need additional information.

My first thought would be to replace all values from 0 to 0,5 with 0,5-only and from 0,5 to 1 with 1-only. But my concern is that this may falsify the data.


Solution

  • Note that in the case of binary variables, the decision boundary does not contain any information. The algorithm displays it as 0.5, but it might just as well be any other value between 0 and (excluding) 1 - the decision rule at that node would still be the same: separate the 0s from the 1s.

    However, after SMOTE resampling, you can't expect those variables to still be binary, so now the decision boundaries are meaningful numbers. To interpret them, you do of course have to keep in mind that they do not refer to the original data, but rather to the resampled set.

    See the answer to this similar question for a cautionary note on using SMOTE with discrete variables.