I have a RDD[(User, Item, Count/Rating)] and I would like to convert it into an RDD[Vector(ItemRatings)] where each Vector is a the item's ratings in the user space. Is there a way to do this without collecting to driver first? I am using Datastax 4.7, Spark 1.2.1 currently.
Thanks!
Assuming that both User
and Item
are encoded as Long
values you can use CoordinateMatrix
.
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
val mat: CoordinateMatrix = new CoordinateMatrix(
rdd.map{case (user, item, rating) => MatrixEntry(item, user, rating)}
)
val vectorRDD: RDD[Vector] = mat.toRowMatrix.rows