I work with a decision tree algorithm on a binary classification problem and the goal is to minimise false positives (maximise positive predicted value
) of the classification (the cost of a diagnostic tool is very high).
Is there a way to introduce a weight
in gini / entropy splitting criteria to penalise for false positive misclassifications?
Here for example, the modified Gini index is given as:
Therefore I am wondering if there any way to implement it in Scikit-learn?
EDIT
Playing with class_weight
produced the following results:
from sklearn import datasets as dts
iris_data = dts.load_iris()
X, y = iris_data.features, iris_data.targets
# take only classes 1 and 2 due to less separability
X = X[y>0]
y = y[y>0]
y = y - 1 # make binary labels
# define the decision tree classifier with only two levels at most and no class balance
dt = tree.DecisionTreeClassifier(max_depth=2, class_weight=None)
# fit the model, no train/test for simplicity
dt.fit(X[:55,:2], y[:55])
plot the decision boundary and the tree Blue are positive (1):
While outweighing the minority class (or more precious):
dt_100 = tree.DecisionTreeClassifier(max_depth=2, class_weight={1:100})
Decision Tree classifiers support the class_weight
argument.
In two class problems, this can exactly solve your issue. Typically this is used for unbalanced problems. For more than two classes, it is not possible to provide the individual labels (as far as I know)