Search code examples
python-3.xnlp

Python: How to compute Jaccard Similarity more quickly


There are about 98,000 sentences (length from 5 - 100 words) in lst_train and about 1000 sentences (length from 5 - 100 words) in lst_test. For each sentence in lst_test I want to find if it's plagiarized from a sentence in lst_train. If the sentence is plagiarized, I should return the id in lst_train or else null.

Now I want to compute the jaccard similarity of each sentence in lst_test relative to each sentence in lst_train. Here's my code, b.JaccardSim computes two sentences' jaccard similarity:

lst_all_p = []
for i in range(len(lst_test)):
    print('i:', i)
    lst_p = []
    for j in range(len(lst_train)):
        b = textSimilarity.TextSimilarity(lst_test[i], lst_train[j])
        lst_p.append(b.JaccardSim(b.str_a,b.str_b))
    lst_all_p.append(lst_p)

But I found that each time of computing one sentence with each sentence in lst_train is more than 1 minutes. Since there are about 1000 sentences, it may take about 1000 minutes to finish it. It is too long.

Do you guys know how to make the computing speed faster or better method to solve the issue to detect the sentence is plagiarized from a sentence in lst_train?


Solution

  • It's probably better to change your approach. Jaccard Similarity isn't super computationally intesive, but if you have to do it for every element in your dataset any non-trivial similarity calculation will be slow.

    If you want to find plagiarism you should looks into near-duplicate detection and locality sensitive hashing. MinHash and SimHash are good starting points, and the datasketch library may also be helpful.

    Note that for many applications vectorizing your sentences and searching for close sentences in vector space is effective. However, this is effective because it's able to understand synonyms - since you're looking for plagiarism you're looking for exact copies, so word vector based methods could end up working against your goal.