apache-spark dataframe apache-spark-sql apache-spark-mllib apache-spark-ml

How to convert Spark DataFrame column of sparse vectors to a column of dense vectors?

I used the following code:

df.withColumn("dense_vector", $"sparse_vector".toDense)

but it gives an error.

I am new to Spark, so this might be obvious and there may be obvious errors in my code line. Please help. Thank you!

Solution

Contexts which require operation like this are relatively rare in Spark. With one or two exception Spark API expects common Vector class not specific implementation (SparseVector, DenseVector). This is also true in case of distributed structures from o.a.s.mllib.linalg.distributed

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val df = Seq[(Long, Vector)](
  (1L, Vectors.dense(1, 2, 3)), (2L, Vectors.sparse(3, Array(1), Array(3)))
).toDF("id", "v")

new RowMatrix(df.select("v")
  .map(_.getAs[Vector]("v")))
  .columnSimilarities(0.9)
  .entries
  .first
// apache.spark.mllib.linalg.distributed.MatrixEntry = MatrixEntry(0,2,1.0)

Nevertheless you could use an UDF like this:

val asDense = udf((v: Vector) => v.toDense)

df.withColumn("vd", asDense($"v")).show
// +---+-------------+-------------+
// | id|            v|           vd|
// +---+-------------+-------------+
// |  1|[1.0,2.0,3.0]|[1.0,2.0,3.0]|
// |  2|(3,[1],[3.0])|[0.0,3.0,0.0]|
// +---+-------------+-------------+

Just keep in mind that since version 2.0 Spark provides two different and compatible Vector types:

o.a.s.ml.linalg.Vector
o.a.s.mllib.linalg.Vector

each with corresponding SQL UDT. See MatchError while accessing vector column in Spark 2.0