Search code examples
pythonpyspark

PySpark - Getting jaccard similarity from co-ocurrence matrix


I have a co-occurence matrix in pyspark for co-ocurrences of certain keywords A,B,C

    A   B   C
A   5   1   0
B   1   3   2
C   0   2   3

How can I calculate the jaccard similarity from this matrix in Python for all keywords. Is there any library available to do that or should I simply compute the similarity by using the Jaccard similarity formula?


Solution

  • Let's assume the co-occurrence matrix is given as a list of lists:

    com = [[5, 1, 0], 
           [1, 3, 2],
           [0, 2, 3]]
    n_elem = len(com)
    

    The Jaccard similarity of two sets A and B is given by |A ∩ B| / |A ∪ B|. The co-occurrence matrix gives the value of |A|, |B|, and |A ∩ B|. The value of |A ∪ B| is simply |A| + |B| - |A ∩ B|, which we can find the Jaccard index.

    First, let's create a list of lists containing ones that is the same size as com. The default value is 1 because the similarity index of a set with itself is 1, and we will not calculate these elements:

    similarity = [[1 for _ in row] for row in com]
    

    Now, we can loop over each pair of values in com and calculate the similarities. The inner loop starts at i+1 because similarity[i][j] is identical to similarity[j][i], so we only need to calculate the upper triangle of the matrix:

    for i in range(n_elem):
        a = com[i][i]               # |A|
        for j in range(i+1, n_elem):
            b = com[j][j]           # |B|
            aib = com[i][j]         # |A ∩ B|
            aub = a + b - aib       # |A ∪ B|
            # Set both off-diagonal elements simultaneously
            similarity[i][j] = similarity[j][i] = aib / aub
    

    This leaves us with the following similarity matrix:

    [[1                  , 0.14285714285714285, 0.0], 
     [0.14285714285714285, 1                  , 0.5], 
     [0.0                , 0.5                , 1]]
    

    Now, if your co-occurrence matrix is a numpy array (or you're open to using numpy), you can speed up this computation by outsourcing the loops to numpy's C backend.

    import numpy as np
    com_arr = np.array([[5, 1, 0], 
           [1, 3, 2],
           [0, 2, 3]])
    n_elem = com_arr.size
    

    First, we can get the occurrence of each element using the diagonal of the matrix:

    occ = np.diag(com_arr) # array([5, 3, 3])
    

    Next, create the matrix of |A ∪ B|. Remember that |A ∩ B| is already specified by com_arr:

    aub = occ[:, None] + occ[None, :] - com_arr
    

    Since occ is a 1-d array, adding a None index will create a 2-d array of one column (a column vector of shape (3, 1)) and one row (a row vector of shape (1, 3)) respectively. When adding a row vector to a column vector, numpy automatically broadcasts the dimensions so that you end up with a (in this case) square matrix of shape (3, 3). Now, aub looks like this:

    array([[5, 7, 8],
           [7, 3, 4],
           [8, 4, 3]])
    

    Finally, divide the intersection by the union:

    similarity = com_arr / aub
    

    et voila, we have the same values as before:

    array([[1.        , 0.14285714, 0.        ],
           [0.14285714, 1.        , 0.5       ],
           [0.        , 0.5       , 1.        ]])