Search code examples
scalaapache-sparkapache-spark-sqlrddapache-spark-mllib

Create LabeledPoint from sparse vector in spark


I created a feature vector in a DataFrame in spark/scala using the VectorAssembler. So far everything works fine. Now I want to create LabeledPoints from the label and the sparse vector.

val labeledPoints = featureDf.map{r=>
  val label = r(0).toString.toDouble + r(1).toString.toDouble + r(2).toString.toDouble
  val features = r(r.size-1)
  LabeledPoint(label, Vectors.sparse(features))

}

But that doesn't work. I get a compile error. The error is:

overloaded method value sparse with alternatives:
(size: Int,elements: Iterable[(Integer,java.lang.Double)])org.apache.spark.mllib.linalg.Vector
<and>
(size: Int,elements: Seq[(Int, scala.Double)])org.apache.spark.mllib.linalg.Vector
<and>
(size: Int,indices: Array[Int],values:Array[scala.Double])org.apache.spark.mllib.linalg.Vector
cannot be applied to (Any)

I already tried to cast the vector with val features = r(r.size-1).asInstanceOf[Vector] and so on but nothing works. Does anyone know how to solve this problem?

Thanks in advance!


Solution

  • What you need here is Row.getAs method:

    val features = r.getAs[org.apache.spark.mllib.linalg.SparseVector](r.size - 1)
    

    It also supports extracting by name so assuming your column is called features:

    r.getAs[org.apache.spark.mllib.linalg.SparseVector]("features")