I have a big DataFrame filled with relations between users and items, like this:
item1 item2
user1 0 1
user2 1 0
and want to solve the all-pairs similarity problem efficiently.
I saw I could use the columnSimilarities
method of the pyspark.mllib
module if I were working with a RowMatrix
object.
As every method I've come up with to solve this with a DataFrame
seems quite inefficient, I'd like to know the best possible way to obtain a RowMatrix
from my DataFrame
.
Or, in the best case, if I'm missing something and there's a better way to face the all-pairs similarity problem with a DataFrame
, I'd love to hear about it.
As mentioned in other answers, there's no way to directly transform a DataFrame
into a RowMatrix
. You first need to get an RDD
object.
To do so on Python:
your_rdd = your_dataframe.rdd.map(list)
your_rowmatrix = RowMatrix(your_rdd)