Search code examples
pythonmachine-learningscikit-learnregressiondecision-tree

Custom Criterion for DecisionTreeRegressor in sklearn


I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).

Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?


Solution

  • I am afraid you can only provide one weight-set when you fit https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit

    And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.

    As for custom criterion:

    There is a similar issue in scikit-learn https://github.com/scikit-learn/scikit-learn/issues/17436

    Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976

    However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.

    So to customize, you may need to hack a lot of code, including:

    1. hacking the fit function to accept a 2D array of weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142

    2. Bypassing the checking (otherwise continue to hack...)

    3. Modify tree builder to allow the weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111 It is terrible, there are a lot of related variable, you should change double to double*

    4. Modify Criterion class to accept a 2-D array of weights https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976

    5. In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).

    TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.