Search code examples
apache-sparkapache-spark-ml

Using VectorAssembler in Spark


I got the following dataframe (it is assumed that it is already a dataframe):

val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12)))
           .toDF("a", "b", "c")

and I want to combine the columns(not all) to one column and make it an rdd of Array[Double]. I am doing the following:

import org.apache.spark.ml.feature.VectorAssembler
val colSelected = List("a","b")
val assembler = new VectorAssembler()
    .setInputCols(colSelected.toArray)
    .setOutputCol("features")
val output = assembler.transform(df).select("features").rdd

Till here it is ok. Now the output is a dataframe of the format RDD[spark.sql.Row]. I am unable to transform this to a format of RDD[Array[Double]]. Any way?

I have tried something like the following but with no success:

output.map { case Row(a: Vector[Double]) => a.getAs[Array[Double]]("features")}

Solution

  • The correct solution (this assumes Spark 2.0+, in 1.x use o.a.s.mllib.linalg.Vector):

    import org.apache.spark.ml.linalg.Vector
    
    output.map(_.getAs[Vector]("features").toArray)
    
    • ml / mllib Vector created by VectorAssembler is not the same as scala.collection.Vector.
    • Row.getAs should be used with expected type. It doesn't perform any type conversions and o.a.s.ml(lib).linalg.Vector is not an Array[Double].