Search code examples
scalaapache-sparkapache-spark-mllib

Convert RDD of Matrix to RDD of Vector


I have a RDD[Matrix[Double]] and want to convert it to RDD[Vector] (Each row in the Matrix will be converted to a Vector).

I've seen related answer like Convert Matrix to RowMatrix in Apache Spark using Scala, but it's one Matrix to RDD of Vector. While my case is RDD of Matrix.


Solution

  • Use flatMap on code to convert Matrix to Seq[Vector]:

    // from https://stackoverflow.com/a/28172826/1206998
    def toSeqOfVector(m: Matrix): Seq[Vector] = {
      val columns = m.toArray.grouped(m.numRows)
      val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
      rows.map(row => new DenseVector(row.toArray))
    }
    
    val matrices: RDD[Matrix] = ??? // your input
    val vectors:  RDD[Vector] = matrices.flatMap(toSeqOfVector)
    

    Note: I didn't test this code, but this is the principle