python machine-learning scikit-learn regression decision-tree

Custom Criterion for DecisionTreeRegressor in sklearn

I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).

Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?

Solution

I am afraid you can only provide one weight-set when you fit https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit

And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.

As for custom criterion:

There is a similar issue in scikit-learn https://github.com/scikit-learn/scikit-learn/issues/17436

Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976

However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.

So to customize, you may need to hack a lot of code, including:

hacking the fit function to accept a 2D array of weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142
Bypassing the checking (otherwise continue to hack...)
Modify tree builder to allow the weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111 It is terrible, there are a lot of related variable, you should change double to double*
Modify Criterion class to accept a 2-D array of weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).

TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.