I got the following dataframe (it is assumed that it is already a dataframe):
val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12)))
.toDF("a", "b", "c")
and I want to combine the columns(not all) to one column and make it an rdd of Array[Double]. I am doing the following:
import org.apache.spark.ml.feature.VectorAssembler
val colSelected = List("a","b")
val assembler = new VectorAssembler()
.setInputCols(colSelected.toArray)
.setOutputCol("features")
val output = assembler.transform(df).select("features").rdd
Till here it is ok. Now the output is a dataframe of the format RDD[spark.sql.Row]
. I am unable to transform this to a format of RDD[Array[Double]]
. Any way?
I have tried something like the following but with no success:
output.map { case Row(a: Vector[Double]) => a.getAs[Array[Double]]("features")}
The correct solution (this assumes Spark 2.0+, in 1.x use o.a.s.mllib.linalg.Vector
):
import org.apache.spark.ml.linalg.Vector
output.map(_.getAs[Vector]("features").toArray)
ml
/ mllib
Vector
created by VectorAssembler
is not the same as scala.collection.Vector
.Row.getAs
should be used with expected type. It doesn't perform any type conversions and o.a.s.ml(lib).linalg.Vector
is not an Array[Double]
.