Search code examples
pythonpysparkrddsimilarity

Pyspark: Converting RDD to RowMatrix


I have an RDD of the form (id1,id2,score). The top(5) rows look like

[(41955624, 42044497, 3.913625989045223e-06),
(41955624, 42039940, 0.0001018890937469129),
(41955624, 42037797, 7.901647831291928e-05),
(41955624, 42011137, -0.00016191403038589588),
(41955624, 42006663, -0.0005302800991148567)]

I want to calculate similarity between id2 members based on scores. I'd like to use RowMatrix.columnSimilarity, but I need to convert it to a RowMatrix first. I want the matrix to be structured id1 x id2 -- i.e., make a row id out of id1 and a column id out of id2.

If my data was smaller I could convert it to a Pyspark dataframe and then use pivot like

rdd_df.groupBy("id1").pivot("id2").sum("score")

but that borks with over 10,000 distinct id2, and I have much more than that.

The naive rdd_Mat = la.RowMatrix(red) brings the data in as a 3-column matrix, which isn't what I want.

Thanks a lot.


Solution

  • The structure of your data more closely resembles the structure of a CoordinateMatrix, which is basically a wrapper for an RDD of (long, long, float) tuples. Because of this, you can very easily create a CoordinetMatrix from your existing RDD.

    from pyspark.mllib.linalg.distributed import CoordinateMatrix
    
    cmat=CoordinateMatrix(yourRDD)
    

    Furthermore, since you originally asked for a RowMatrix, pyspark provides a way to easily convert between matrix types:

    rmat=cmat.toRowMatrix()
    

    giving you the desired RowMatrix.