Search code examples
scalaapache-sparkscala-collectionsapache-spark-ml

How to convert scala vector to spark ML vector?


I have a vector of type scala.collection.immutable.Vector and would like to convert it to a vector of type org.apache.spark.ml.linalg.Vector.

For example, I want something like the following;

import org.apache.spark.ml.linalg.Vectors
val scalaVec = Vector(1,2,3)
val sparkVec = Vectors.dense(scalaVec)

Note that I could simply type val sparkVec = Vectors.dense(1,2,3) but I want to convert existing scala collection Vectors. I want to do this to embed these DenseVectors in a DataFrame to feed into spark.ml pipelines.


Solution

  • Vectors.dense can take an array of doubles. Likely what is causing your trouble is that Vectors.dense won't accept Ints which you are using in scalaVec in your example. So the following fails:

    val test = Seq(1,2,3,4,5).to[scala.Vector].toArray
    Vectors.dense(test)
    
    import org.apache.spark.ml.linalg.Vectors
    test: Array[Int] = Array(1, 2, 3, 4, 5)
    <console>:67: error: overloaded method value dense with alternatives:
      (values: Array[Double])org.apache.spark.ml.linalg.Vector <and>
      (firstValue: Double,otherValues: Double*)org.apache.spark.ml.linalg.Vector cannot be applied to (Array[Int])
       Vectors.dense(test)
    

    While this works:

    val testDouble = Seq(1,2,3,4,5).map(x=>x.toDouble).to[scala.Vector].toArray
    Vectors.dense(testDouble)
    
    testDouble: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0)
    res11: org.apache.spark.ml.linalg.Vector = [1.0,2.0,3.0,4.0,5.0]