Search code examples
scalaapache-sparkapache-spark-mllib

Converting a vector column in a dataframe back into an array column


I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers.

+---+-----+
| id| dist| 
+---+-----+
|1.0|[2.0]|
|2.0|[4.0]|
|3.0|[6.0]|
|4.0|[8.0]|
+---+-----+

I tried using several variants of the following udf but it returns a type mismatch error

val toInt4 = udf[Int, Vector]({ (a) => (a)})  

val result = df.withColumn("dist", toDf4(df("dist"))).select("dist")

Solution

  • I think it's easiest to do it by going to the RDD API and then back.

    import org.apache.spark.mllib.linalg.DenseVector
    import org.apache.spark.sql.DataFrame
    import org.apache.spark.rdd.RDD
    import sqlContext._
    
    // The original data.
    val input: DataFrame =
      sc.parallelize(1 to 4)
        .map(i => i.toDouble -> new DenseVector(Array(i.toDouble * 2)))
        .toDF("id", "dist")
    
    // Turn it into an RDD for manipulation.
    val inputRDD: RDD[(Double, DenseVector)] =
      input.map(row => row.getAs[Double]("id") -> row.getAs[DenseVector]("dist"))
    
    // Change the DenseVector into an integer array.
    val outputRDD: RDD[(Double, Array[Int])] =
      inputRDD.mapValues(_.toArray.map(_.toInt))
    
    // Go back to a DataFrame.
    val output = outputRDD.toDF("id", "dist")
    output.show
    

    You get:

    +---+----+
    | id|dist|
    +---+----+
    |1.0| [2]|
    |2.0| [4]|
    |3.0| [6]|
    |4.0| [8]|
    +---+----+