Search code examples
pythonapache-sparkpysparkapache-spark-mllibrecommendation-engine

Spark - Converting DataFrame to RowMatrix to compute all-pairs similarity efficiently


I have a big DataFrame filled with relations between users and items, like this:

        item1  item2
user1       0      1
user2       1      0

and want to solve the all-pairs similarity problem efficiently.

I saw I could use the columnSimilarities method of the pyspark.mllib module if I were working with a RowMatrix object.

As every method I've come up with to solve this with a DataFrame seems quite inefficient, I'd like to know the best possible way to obtain a RowMatrix from my DataFrame.

Or, in the best case, if I'm missing something and there's a better way to face the all-pairs similarity problem with a DataFrame, I'd love to hear about it.


Solution

  • As mentioned in other answers, there's no way to directly transform a DataFrame into a RowMatrix. You first need to get an RDD object.

    To do so on Python:

    your_rdd = your_dataframe.rdd.map(list)
    your_rowmatrix = RowMatrix(your_rdd)