Search code examples
pythonscikit-learnclassificationdecision-treegini

How to amend the splitting criteria (gini/entropy) in a decision tree algorithm in Scikit-Learn?


I work with a decision tree algorithm on a binary classification problem and the goal is to minimise false positives (maximise positive predicted value) of the classification (the cost of a diagnostic tool is very high).

Is there a way to introduce a weight in gini / entropy splitting criteria to penalise for false positive misclassifications?

Here for example, the modified Gini index is given as:

enter image description here

Therefore I am wondering if there any way to implement it in Scikit-learn?

EDIT

Playing with class_weight produced the following results:

from sklearn import datasets as dts
iris_data = dts.load_iris()

X, y = iris_data.features, iris_data.targets
# take only classes 1 and 2 due to less separability
X = X[y>0]
y = y[y>0]
y = y - 1 # make binary labels

# define the decision tree classifier with only two levels at most and no class balance
dt = tree.DecisionTreeClassifier(max_depth=2, class_weight=None)

# fit the model, no train/test for simplicity
dt.fit(X[:55,:2], y[:55])

plot the decision boundary and the tree Blue are positive (1):

enter image description here

While outweighing the minority class (or more precious):

dt_100 = tree.DecisionTreeClassifier(max_depth=2, class_weight={1:100})

enter image description here


Solution

  • Decision Tree classifiers support the class_weight argument.

    In two class problems, this can exactly solve your issue. Typically this is used for unbalanced problems. For more than two classes, it is not possible to provide the individual labels (as far as I know)