My question is equivalent to R-related post Create Sparse Matrix from a data frame, except that I would like to perform the same thing on Spark (preferably in Scala).
Sample of data in the data.txt file from which the sparse matrix is being created:
UserID MovieID Rating
2 1 1
3 2 1
4 2 1
6 2 1
7 2 1
So in the end the columns are the movie IDs and the rows are the user IDs
1 2 3 4 5 6 7
1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 0 1 0 0 0 0 0
4 0 1 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 1 0 0 0 0 0
7 0 1 0 0 0 0 0
I've actually started by doing a map
RDD transformation on the data.txt
file (without the headers) to convert values into Integer, but then ... I could not find a function for sparse matrix creation.
val data = sc.textFile("/data/data.txt")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toInt)
})
...?
The simplest way is to map Ratings
to MatrixEntries
an create CoordinateMatrix
:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val mat = new CoordinateMatrix(ratings.map {
case Rating(user, movie, rating) => MatrixEntry(user, movie, rating)
})
CoordinateMatrix
can be further converted to BlockMatrix
, IndexedRowMatrix
, RowMatrix
using toBlockMatrix
, toIndexedRowMatrix
, toRowMatrix
respectively.