Search code examples
scalaapache-sparkapache-spark-ml

How to Split the Predicted Probabilities Produced by ML Pileline Logistic Regression


I'm trying to extract the predicted probability from the logistic model using ML pipeline and DataFrame API. The output of predicted probabilities is a column vector that stores the predicted probabilities for each class(0, 1) in as shown below. I wonder how I can extract only the probability for class 1. Thank you!

prob
"[0.13293408418007766,0.8670659158199223]"
"[0.1335112097146626,0.8664887902853374]"


Solution

  • UDF like this should work:

    import org.apache.spark.sql.functions.udf
    
    val getPOne = udf((v: org.apache.spark.mllib.linalg.Vector) => v(1))
    model.transform(testDf).select(getPOne($"probability"))