Search code examples
scalaapache-sparkmachine-learningapache-spark-ml

Scala Spark model transform returns all zeroes


Good time of the day,everyone. To start with i'm doing simple machine learning task with apache-spark ml(not mllib) and scala. My build.sbt as follows :

name := "spark"
version := "1.0"
scalaVersion := "2.11.11"
libraryDependencies += "org.apache.spark" %% "spark-core"  % "2.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
libraryDependencies += "com.databricks" %% "spark-csv" % "1.0.1"

All stages doing just fine. But there problem with dataset which should contain predictions. In my case i'm doing classification on three classes, the lables are 1.0, 2.0, 3.0, but prediction column contains only of 0.0 labels, even though there is no such label at all. Here is original dataframe :

+--------------------+--------+
|               tfIdf|estimate|
+--------------------+--------+
|(3000,[0,1,8,14,1...|     3.0|
|(3000,[0,1707,223...|     3.0|
|(3000,[1,24,33,64...|     3.0|
|(3000,[1,40,114,5...|     2.0|
|(3000,[1,363,743,...|     2.0|
|(3000,[2,20,65,88...|     3.0|
|(3000,[3,15,21,23...|     3.0|
|(3000,[3,45,53,14...|     3.0|
|(3000,[3,387,433,...|     1.0|
|(3000,[3,523,629,...|     3.0|
+--------------------+--------+

And after classification, my predictions are :

+--------------------+--------+----------+
|               tfIdf|estimate|prediction|
+--------------------+--------+----------+
|(3000,[0,1,8,14,1...|     3.0|       0.0|
|(3000,[0,1707,223...|     3.0|       0.0|
|(3000,[1,24,33,64...|     3.0|       0.0|
|(3000,[1,40,114,5...|     2.0|       0.0|
|(3000,[1,363,743,...|     2.0|       0.0|
|(3000,[2,20,65,88...|     3.0|       0.0|
|(3000,[3,15,21,23...|     3.0|       0.0|
|(3000,[3,45,53,14...|     3.0|       0.0|
|(3000,[3,387,433,...|     1.0|       0.0|
|(3000,[3,523,629,...|     3.0|       0.0|
+--------------------+--------+----------+

And my code as follows :

 val toDouble = udf[Double, String](_.toDouble)
  val kribrumData = krData.withColumn("estimate", toDouble(krData("estimate")))
    .select($"text",$"estimate")

  kribrumData.cache()

  val tokenizer = new Tokenizer()
    .setInputCol("text")
    .setOutputCol("tokens")
  val stopWordsRemover = new StopWordsRemover()
    .setInputCol("tokens")
    .setOutputCol("filtered")
    .setStopWords(STOP_WORDS)
  val hashingTF = new HashingTF()
    .setInputCol("filtered")
    .setNumFeatures(3000)
    .setOutputCol("tf")
  val idf = new IDF()
    .setInputCol("tf")
    .setOutputCol("tfIdf")
  val preprocessor = new Pipeline()
    .setStages(Array(tokenizer,stopWordsRemover,hashingTF,idf))
  val preprocessor_model = preprocessor.fit(kribrumData)

  val preprocessedKribrumData = preprocessor_model.transform(kribrumData)
    .select("tfIdf", "estimate")

  var Array(train, test) = preprocessedKribrumData.randomSplit(Array(0.8, 0.2), seed = 7)

  test.show(10)

  val logisticRegressor = new LogisticRegression()
    .setMaxIter(10)
    .setRegParam(0.3)
    .setElasticNetParam(0.8)
    .setLabelCol("estimate")
    .setFeaturesCol("tfIdf")
  val classifier = new OneVsRest()
    .setLabelCol("estimate")
    .setFeaturesCol("tfIdf")
    .setClassifier(logisticRegressor)


  val model = classifier.fit(train)

  val predictions = model.transform(test)

  predictions.show(10)

  val evaluator = new MulticlassClassificationEvaluator()
    .setMetricName("accuracy").setLabelCol("estimate")

  val accuracy = evaluator.evaluate(predictions)

  println("Classification accuracy" + accuracy.toString)

This code eventually inspires prediction accuracy equals to zero(because there are no label "0.0" in target column "estimate"). So, what actually am i doing wrong? Any ideas will be greatly appreciated.


Solution

  • Finally i figured out the problem. Spark does not throws error or exception, when label field is double, but labels are not in a valid range for classifier, to overcome this usage of StringIndexer is reuired, so we just need to add in pipeline :

    val labelIndexer = new StringIndexer()
      .setInputCol("estimate")
      .setOutputCol("indexedLabel")
    

    This step solves the problem, however it is inconvinient.