Search code examples
apache-sparkpredictapache-spark-mllib

Is it possible to obtain class probabilities using GradientBoostedTrees with spark mllib?


I am currently working with spark mllib.

I have created a text classifier using the Gradient Boosting algorithm with the class GradientBoostedTrees:

Gradient Boosted Trees

Currently I obtain the predictions to know the class of new elements but I would like to obtain the class probabilities (the value of the output before the hard decision).

In other mllib algorithms like logistic regression you can remove the threshold from the classifier to obtain the class probabilities but I can not find a way to do the same procedure with GradientBosstedTrees.


Solution

  • It seems that in Spark MLLIB it is not possible to obtain the class probabilities.

    You can only obtain the final classification decision.

    That's a pity because that information would be very useful (If you classify a sample as positive with 99.99% of posibilities is not the same than 51%) and it is not difficult to obtain that information once the model has been trained.

    An alternative is using a different software like xgboost: https://github.com/dmlc/xgboost