Search code examples
pythonperformancenumpysimilarity

What can I do to improve sklearn's Jaccard similarity score performance on 9000+ data


I am trying create a table of Jaccard similarity score on a list of vectors x with every other elements in the list that has over 9000 rows (so resulting to a roughly 9000, 9000 list):

[[  2   2  67   2   5   3  62  68  27]
[  2   9  67   2   1   3  20  62 139]
[  2  17  67   2   0   6   6  62  73]
[  2  17  67   2   0   6  39  68  92]
[  0   0  67   0   0   3  62  62  13]
...

I'm a beginner so I tried to my implement shameful excuse of a code like this:

similarities_matrix = np.empty([len(x), len(x)])
for icounter, i in enumerate(x.as_matrix()):
    similarities_row = np.empty(len(x))
    for jcounter, j in enumerate(x.as_matrix()):
        similarities_row[jcounter] = jaccard_similarity_score(i, j)
    similarities_matrix[icounter] = similarities_row
pprint(similarities_matrix) 

But it runs impossibly slow. Ideally I wanted my code to run within the span of my lifetime (preferably under 5 minutes.)

Currently, this code runs roughly a second per element to compute the similarity matrix.


Solution

  • If you don't mind using scipy, you can use the function pdist from scipy.spatial.distance. The value computed by sklearn.metrics.jaccard_similarity_score(u, v) is equivalent to 1 -scipy.spatial.distance.hamming(u, v). For example,

    In [71]: from sklearn.metrics import jaccard_similarity_score
    
    In [72]: from scipy.spatial.distance import hamming
    
    In [73]: u = [2, 1, 3, 5]
    
    In [74]: v = [2, 1, 4, 5]
    
    In [75]: jaccard_similarity_score(u, v)
    Out[75]: 0.75
    
    In [76]: 1 - hamming(u, v)
    Out[76]: 0.75
    

    'hamming' is one of the metrics provided by scipy.spatial.distance.pdist, so you can use that function to compute all the pairwise distances. Here's a small x to use as an example:

    In [77]: x = np.random.randint(0, 5, size=(8, 10))
    
    In [78]: x
    Out[78]: 
    array([[4, 2, 2, 3, 1, 2, 0, 0, 4, 0],
           [3, 1, 4, 2, 3, 1, 2, 3, 4, 4],
           [1, 1, 0, 1, 0, 2, 0, 3, 3, 4],
           [2, 3, 3, 3, 1, 2, 3, 2, 1, 2],
           [3, 2, 3, 2, 0, 0, 4, 4, 3, 4],
           [3, 0, 1, 0, 4, 2, 0, 2, 1, 0],
           [4, 3, 2, 4, 1, 2, 3, 3, 2, 4],
           [3, 0, 4, 1, 3, 3, 3, 3, 1, 3]])
    

    I'll use squareform to convert the output of pdist to the symmetric array of similarities.

    In [79]: from scipy.spatial.distance import pdist, squareform
    
    In [80]: squareform(1 - pdist(x, metric='hamming'))
    Out[80]: 
    array([[ 1. ,  0.1,  0.2,  0.3,  0.1,  0.3,  0.4,  0. ],
           [ 0.1,  1. ,  0.3,  0. ,  0.3,  0.1,  0.2,  0.4],
           [ 0.2,  0.3,  1. ,  0.1,  0.3,  0.2,  0.3,  0.2],
           [ 0.3,  0. ,  0.1,  1. ,  0.1,  0.3,  0.4,  0.2],
           [ 0.1,  0.3,  0.3,  0.1,  1. ,  0.1,  0.1,  0.1],
           [ 0.3,  0.1,  0.2,  0.3,  0.1,  1. ,  0.1,  0.3],
           [ 0.4,  0.2,  0.3,  0.4,  0.1,  0.1,  1. ,  0.2],
           [ 0. ,  0.4,  0.2,  0.2,  0.1,  0.3,  0.2,  1. ]])
    

    I converted your code to this function:

    def jaccard_sim_matrix(x):
        similarities_matrix = np.empty([len(x), len(x)])
        for icounter, i in enumerate(x):
            similarities_row = np.empty(len(x))
            for jcounter, j in enumerate(x):
                similarities_row[jcounter] = jaccard_similarity_score(i, j)
            similarities_matrix[icounter] = similarities_row
        return similarities_matrix 
    

    so we can verify that the pdist result is the same as your calculation.

    In [81]: jaccard_sim_matrix(x)
    Out[81]: 
    array([[ 1. ,  0.1,  0.2,  0.3,  0.1,  0.3,  0.4,  0. ],
           [ 0.1,  1. ,  0.3,  0. ,  0.3,  0.1,  0.2,  0.4],
           [ 0.2,  0.3,  1. ,  0.1,  0.3,  0.2,  0.3,  0.2],
           [ 0.3,  0. ,  0.1,  1. ,  0.1,  0.3,  0.4,  0.2],
           [ 0.1,  0.3,  0.3,  0.1,  1. ,  0.1,  0.1,  0.1],
           [ 0.3,  0.1,  0.2,  0.3,  0.1,  1. ,  0.1,  0.3],
           [ 0.4,  0.2,  0.3,  0.4,  0.1,  0.1,  1. ,  0.2],
           [ 0. ,  0.4,  0.2,  0.2,  0.1,  0.3,  0.2,  1. ]])
    

    Here I'll compare the timing for a larger array:

    In [82]: x = np.random.randint(0, 5, size=(500, 10))
    
    In [83]: %timeit jaccard_sim_matrix(x)
    14.9 s ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [84]: %timeit squareform(1 - pdist(x, metric='hamming'))
    1.19 ms ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Finally, let's time the calculation for an input with shape (9000, 10):

    In [94]: x = np.random.randint(0, 5, size=(9000, 10))
    
    In [95]: %timeit squareform(1 - pdist(x, metric='hamming'))
    1.34 s ± 9.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    That's just 1.34 seconds--definitely within the span of a lifetime.