I'm using this code to get data from Hive into Spark:
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val MyTab = hc.sql("select * from svm_file")
and I get DataFrame:
scala> MyTab.show()
+--------------------+
| line|
+--------------------+
|0 2072:1 8594:1 7...|
|0 8609:3 101617:1...|
| 0 7745:2|
|0 6696:2 9568:21 ...|
|0 200076:1 200065...|
|0 400026:20 6936:...|
|0 7793:2 9221:7 1...|
|0 4831:1 400026:1...|
|0 400011:1 400026...|
|0 200072:1 6936:1...|
|0 200065:29 4831:...|
|1 400026:20 3632:...|
|0 400026:19 6936:...|
|0 190004:1 9041:2...|
|0 190005:1 100120...|
|0 400026:21 6936:...|
|0 190004:1 3116:3...|
|0 1590:12 8594:56...|
|0 3632:2 9240:1 4...|
|1 400011:1 400026...|
+--------------------+
only showing top 20 rows
How can I transform this DataFrame to libSVM to perform logistic regression like in this example: https://altiscale.zendesk.com/hc/en-us/articles/202627136-Spark-Shell-Examples ?
I would say don't load it into DataFrame
in the first place and simply use MLUtils.loadLibSVMFile
but if for some reason this is not on an option you can convert to RDD[String]
and use the same map logic as used by loadLibSVMFile
import org.apache.spark.sql.Row
import org.apache.spark.mllib.regression.LabeledPoint
MyTab
.map{ case Row(line: String) => line }
.map(_.trim)
.filter(line => !(line.isEmpty || line.startsWith("#")))
.map { line => ??? }
In place of ???
just copy and paste a relevant part of the loadLibSVMFile
method