Search code examples
apache-sparkdatastax-enterpriseapache-spark-mllib

Spark Vector RDD creation without collecting to driver


I have a RDD[(User, Item, Count/Rating)] and I would like to convert it into an RDD[Vector(ItemRatings)] where each Vector is a the item's ratings in the user space. Is there a way to do this without collecting to driver first? I am using Datastax 4.7, Spark 1.2.1 currently.

Thanks!


Solution

  • Assuming that both User and Item are encoded as Long values you can use CoordinateMatrix.

    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
    import org.apache.spark.mllib.linalg.Vector
    import org.apache.spark.rdd.RDD
    
    val mat: CoordinateMatrix = new CoordinateMatrix(
      rdd.map{case (user, item, rating) => MatrixEntry(item, user, rating)}
    )
    
    val vectorRDD: RDD[Vector] = mat.toRowMatrix.rows