apache-spark apache-spark-sql apache-spark-ml

type mismatch error while running ml.PredictionModel in spark

After training all the model, i am trying to rename each model prediction column to uniquely identify the model prediction inside the dataset.I am getting type mismatch error as specified below :

import org.apache.spark.ml.PredictionModel

import org.apache.spark.sql.DataFrame

val models = Seq(("NB", nbModel), ("DT", dtModel), ("RF", rfModel), ("GBM",gbmModel))

its output is given below:

models: Seq[(String, Any)] = List((NB,NaiveBayesModel (uid=nb_699528805899) with 2 classes), (DT,()), (RF,RandomForestClassificationModel (uid=rfc_403e93000cb6) with 10 trees), (GBM,GBTClassificationModel (uid=gbtc_e778e2781d0b) with 20 trees))

def mlData(inputData: DataFrame, responseColumn: String, baseModels:

  Seq[(String, PredictionModel[_, _])]): DataFrame= {

  baseModels.map{ case(name, model) =>

  model.transform(inputData)

  .select("row_id", model.getPredictionCol )

  .withColumnRenamed("prediction", s"${name}_prediction")

  }.reduceLeft((a, b) =>a.join(b, Seq("row_id"), "inner"))

  .join(inputData.select("row_id", responseColumn), Seq("row_id"),

  "inner")

}

its output is given below:

mlData: (inputData: org.apache.spark.sql.DataFrame, responseColumn: String, baseModels: Seq[(String, org.apache.spark.ml.PredictionModel[_, _])]) org.apache.spark.sql.DataFrame

val mlTrainData= mlData(transferData, "value", models).drop("row_id")

i am getting type mismatch error, that actually should not had occurred

<console>:102: error: type mismatch; found : Seq[(String, Any)] required: Seq[(String, org.apache.spark.ml.PredictionModel[_, _])] val mlTrainData= mlData(transferData, "value", models).drop("row_id")

Solution

Just based on the output it is clear that the second element in the DT tuple is Unit not a PredictionModel - that's why whole object is Seq[(_, Any)] and your code fails.

Since you don't provide context it is not clear how you get there.