Search code examples
apache-sparkdataframeapache-spark-sqlapache-spark-mllibapache-spark-ml

How to convert Spark DataFrame column of sparse vectors to a column of dense vectors?


I used the following code:

df.withColumn("dense_vector", $"sparse_vector".toDense)  

but it gives an error.

I am new to Spark, so this might be obvious and there may be obvious errors in my code line. Please help. Thank you!


Solution

  • Contexts which require operation like this are relatively rare in Spark. With one or two exception Spark API expects common Vector class not specific implementation (SparseVector, DenseVector). This is also true in case of distributed structures from o.a.s.mllib.linalg.distributed

    import org.apache.spark.mllib.linalg.{Vector, Vectors}
    import org.apache.spark.mllib.linalg.distributed.RowMatrix
    
    val df = Seq[(Long, Vector)](
      (1L, Vectors.dense(1, 2, 3)), (2L, Vectors.sparse(3, Array(1), Array(3)))
    ).toDF("id", "v")
    
    new RowMatrix(df.select("v")
      .map(_.getAs[Vector]("v")))
      .columnSimilarities(0.9)
      .entries
      .first
    // apache.spark.mllib.linalg.distributed.MatrixEntry = MatrixEntry(0,2,1.0)
    

    Nevertheless you could use an UDF like this:

    val asDense = udf((v: Vector) => v.toDense)
    
    df.withColumn("vd", asDense($"v")).show
    // +---+-------------+-------------+
    // | id|            v|           vd|
    // +---+-------------+-------------+
    // |  1|[1.0,2.0,3.0]|[1.0,2.0,3.0]|
    // |  2|(3,[1],[3.0])|[0.0,3.0,0.0]|
    // +---+-------------+-------------+
    

    Just keep in mind that since version 2.0 Spark provides two different and compatible Vector types:

    • o.a.s.ml.linalg.Vector
    • o.a.s.mllib.linalg.Vector

    each with corresponding SQL UDT. See MatchError while accessing vector column in Spark 2.0