Search code examples
machine-learningproduction-environmentxgboost

Why machine learning algorithms like xgboost cannot be used in the production environment?


I am a data scientist and I have seen at my workplace that all the major production solution at maximum involves random forest.

Why machine learning algorithms like xgboost cannot be used in the production environment? Why is there a need of reproducibility?


Solution

  • I can't speak for everyone but in most cases you want to have a reason for a decision. You need to be able to convince your clients/your boss that this is the right decision/prediction. If you use neural networks or other black box models you only have the resulting prediction and if you are lucky also a confidence estimate.

    "White box" models or models which can be interpreted are better, because you can point to specific features of a sample and say that these are the reasons for the resulting prediction. Decision trees (but not too deep) or simple thresholding belong to this category.

    If I understand the concept of xgboost correctly, you train your new trees to correct the mistakes of the previous ones. This means that the trees are not independent and therefore difficult to interpret.