apache-spark dataframe apache-spark-sql logistic-regression libsvm

Spark: How change DataFrame to LibSVM and perform logistic regression

I'm using this code to get data from Hive into Spark:

val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val MyTab = hc.sql("select * from svm_file")

and I get DataFrame:

scala> MyTab.show()
+--------------------+
|                line|
+--------------------+
|0 2072:1 8594:1 7...|
|0 8609:3 101617:1...|
|            0 7745:2|
|0 6696:2 9568:21 ...|
|0 200076:1 200065...|
|0 400026:20 6936:...|
|0 7793:2 9221:7 1...|
|0 4831:1 400026:1...|
|0 400011:1 400026...|
|0 200072:1 6936:1...|
|0 200065:29 4831:...|
|1 400026:20 3632:...|
|0 400026:19 6936:...|
|0 190004:1 9041:2...|
|0 190005:1 100120...|
|0 400026:21 6936:...|
|0 190004:1 3116:3...|
|0 1590:12 8594:56...|
|0 3632:2 9240:1 4...|
|1 400011:1 400026...|
+--------------------+
only showing top 20 rows

How can I transform this DataFrame to libSVM to perform logistic regression like in this example: https://altiscale.zendesk.com/hc/en-us/articles/202627136-Spark-Shell-Examples ?

Solution

I would say don't load it into DataFrame in the first place and simply use MLUtils.loadLibSVMFile but if for some reason this is not on an option you can convert to RDD[String] and use the same map logic as used by loadLibSVMFile

import org.apache.spark.sql.Row
import org.apache.spark.mllib.regression.LabeledPoint

MyTab
  .map{ case Row(line: String) => line }
  .map(_.trim)
  .filter(line => !(line.isEmpty || line.startsWith("#")))
  .map { line => ??? }

In place of ??? just copy and paste a relevant part of the loadLibSVMFile method