python apache-spark pyspark apache-spark-mllib recommendation-engine

Spark - Converting DataFrame to RowMatrix to compute all-pairs similarity efficiently

I have a big DataFrame filled with relations between users and items, like this:

        item1  item2
user1       0      1
user2       1      0

and want to solve the all-pairs similarity problem efficiently.

I saw I could use the columnSimilarities method of the pyspark.mllib module if I were working with a RowMatrix object.

As every method I've come up with to solve this with a DataFrame seems quite inefficient, I'd like to know the best possible way to obtain a RowMatrix from my DataFrame.

Or, in the best case, if I'm missing something and there's a better way to face the all-pairs similarity problem with a DataFrame, I'd love to hear about it.

Solution

As mentioned in other answers, there's no way to directly transform a DataFrame into a RowMatrix. You first need to get an RDD object.

To do so on Python:

your_rdd = your_dataframe.rdd.map(list)
your_rowmatrix = RowMatrix(your_rdd)