I am completely new to Scala/Spark and I am trying to create from scratch a Spark application that computes the exact Jaccard similarity between n
sets of integers (you don't need to know what it is to answer this question).
I have a Dataframe where each row is a set of integers, for example:
var sets = List(Set(1, 5, 7, 4), Set(3, 5, 0), Set(10, 1, 5)).toDF
and a function jacsim(s1, s2)
that returns the Jaccard similarity between two sets. I want to define a function that given the sets
dataframe returns another dataframe that contains at position (i, j) the result of jacsim(sets(i), sets(j))
. How can I do that?
Additionally: Is it goot idea of using the resulting dataframe as a table? I am reading that Spark does not "like" rows being accessed by index as this hinders parallelism. Should I return a dataframe with a single row and each possible pair as a new column?
As you mentioned accessing spark dataframe using indexes is not allowed. Here is one solution using scala spark dataframe :
var sets = List(Set(1, 5, 7, 4), Set(3, 5, 0), Set(10, 1, 5)).toDF("sets")
.withColumn("i",monotonically_increasing_id()) // to create indexes
val jaccardSimUDF = udf((set1: Seq[Int], set2: Seq[Int]) => set1.sum + set2.sum) // dummy function, replace it with your implementation of Jaccard similarity
val resDF = sets.crossJoin(sets.withColumnRenamed("sets", "sets2").withColumnRenamed("i", "j"))
.withColumn("jaccardSim", jaccardSimUDF($"sets", $"sets2"))
Basically we need to do a cross join of your dataframe against itself to have all combinations. Then we can apply a "user defined function" (UDF) to compute jaccard similarities. Note that I created indexes for conveniency.
Now if you really want to have a matrix you will need to reshape this dataframe but this is not spark essence.
As pointed out in the comment jaccard similarity function is symmetric so you can filter unnecessary indexes, like this :
val resDF = sets.crossJoin(sets.withColumnRenamed("sets", "sets2").withColumnRenamed("i", "j"))
.filter($"i" < $"j")
.withColumn("jaccardSim", jaccardSimUDF($"sets", $"sets2"))
It may seem ugly as it still involves a full crossjoin but as spark relies on lazy computation and Catalyst optimizer it is not really a full crossjoin in practice. So I don't think there is a better solution.