Search code examples

Why mapped pairs get obliterated?

I'm trying to understand the example here which computes Jaccard similarity between pairs of vectors in a matrix.

val aBinary = adjacencyMatrix.binarizeAs[Double]

// intersectMat holds the size of the intersection of row(a)_i n row (b)_j
val intersectMat = aBinary * aBinary.transpose
val aSumVct = aBinary.sumColVectors
val bSumVct = aBinary.sumRowVectors

//Using zip to repeat the row and column vectors values on the right hand
//for all non-zeroes on the left hand matrix
val xMat = pair => pair._2 )
val yMat = pair => pair._2 )

Why does the last comment mention non-zero values? As far as I'm aware, the ._2 function selects the second element of a pair independent of the first element. At what point are (0, x) pairs obliterated?


  • Yeah, I don't know anything about scalding but this seems odd. If you look at zip implementation it mentions specifically that it does an outer join to preserve zeros on either side. So it does not seem that the comment applies to how zeroes are actually treated in

    Besides looking at the dimension returned by zip, it really seems this line just replicates the aSumVct column vector for each column:

    val xMat = pair => pair._2 )

    Also I find the val bSumVct = aBinary.sumRowVectors suspicious, because it sums the matrix along the wrong dimension. It feels like something like this would be better:

    val bSumVct = aBinary.tranpose.sumRowVectors

    Which would conceptually be the same as aSumVct.transpose, so that at the end of the day, in the cell (i, j) of xMat + yMat we find the sum of elements of row(i) plus the sum of elements of row(j), then we subtract intersectMat to adjust for the double counting.

    Edit: a little bit of googling unearthed this blog post: It seems the comments were related to that version where the vectors to compare are in two separate matrices that don't necessarily have the same size.