Search code examples
pythonscikit-learndecision-tree

Is there a difference in the underlying sklearn 'entropy' and 'log_loss' criteria for decision tree classifiers?


I'm implementing an decision tree classifier using sklearn and testing out different criteria, but I can't seem to find what the difference is between the 'entropy' and 'log_loss' criteria. The underlying _classes.py of the tree folder in sklearn's source code defines the log loss and entropy as the same type in its classifier criteria dict, presumably making them the same operation?

CRITERIA_CLF = {
    "gini": _criterion.Gini,
    "log_loss": _criterion.Entropy,
    "entropy": _criterion.Entropy,
}

What is the underlying difference? Seems running either log loss and entropy does the same thing under the hood.


Solution

  • There is no difference between log_loss and entropy in the context of decision tree and random forest algorithms. In fact, these are just interchangeable terms used to refer to the same criterion. As mentioned in sklearn documentation here:

    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the Shannon information gain, see Mathematical formulation. Note: This parameter is tree-specific.

    Also checking the mathematical formulas provided in the sklearn document you can find out these are interchangeable terms in this context. The best term to use theoretically in my opinion is the Entropy as described by [The Shannon Entropy.][3] The term "log_loss" does seem a bit out of place in the context of decision trees. Also it is notable that in the the previous versions of sklearn the only criteria for decision tree split were entropy and gini.

    [3]: https://www.sciencedirect.com/topics/engineering/shannon-entropy#:~:text=The%20Shannon%20entropy%20S%20(%20x,new%20value%20in%20the%20process.