I used a random forest to classify texts to certain categories. When I used my testdata I got an accuracy of 0.98. But with another set of data the overall accuracy decreases to 0.7. I think, most of the rows still have a high accuracy.
So now I want to show only the predicted categories with a high confidence. random-forrest gives me a column "probability", which is an array of probabilities. How do I get the actual probabilty of the chosen prediction?
val randomForrest = new RandomForestClassifier()
.setLabelCol(labelIndexer.getOutputCol)
.setFeaturesCol(vectorAssembler.getOutputCol)
.setProbabilityCol("probability")
.setSeed(123)
.setPredictionCol("prediction")
I eventually came up with the following udf to get the best prediction together with its probability. If there is a more convenient way, please comment.
def getBestPrediction = udf((
rawPrediction: org.apache.spark.ml.linalg.Vector, probability: org.apache.spark.ml.linalg.Vector) => {
val bestPrediction = probability.argmax
val bestProbability = probability(bestPrediction)
(bestPrediction, bestProbability)
})