I used the following code:
df.withColumn("dense_vector", $"sparse_vector".toDense)
but it gives an error.
I am new to Spark, so this might be obvious and there may be obvious errors in my code line. Please help. Thank you!
Contexts which require operation like this are relatively rare in Spark. With one or two exception Spark API expects common Vector
class not specific implementation (SparseVector
, DenseVector
). This is also true in case of distributed structures from o.a.s.mllib.linalg.distributed
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val df = Seq[(Long, Vector)](
(1L, Vectors.dense(1, 2, 3)), (2L, Vectors.sparse(3, Array(1), Array(3)))
).toDF("id", "v")
new RowMatrix(df.select("v")
.map(_.getAs[Vector]("v")))
.columnSimilarities(0.9)
.entries
.first
// apache.spark.mllib.linalg.distributed.MatrixEntry = MatrixEntry(0,2,1.0)
Nevertheless you could use an UDF like this:
val asDense = udf((v: Vector) => v.toDense)
df.withColumn("vd", asDense($"v")).show
// +---+-------------+-------------+
// | id| v| vd|
// +---+-------------+-------------+
// | 1|[1.0,2.0,3.0]|[1.0,2.0,3.0]|
// | 2|(3,[1],[3.0])|[0.0,3.0,0.0]|
// +---+-------------+-------------+
Just keep in mind that since version 2.0 Spark provides two different and compatible Vector
types:
o.a.s.ml.linalg.Vector
o.a.s.mllib.linalg.Vector
each with corresponding SQL UDT. See MatchError while accessing vector column in Spark 2.0