Search code examples
pythonstring-comparisonlarge-data

large-scale string comparison


I'm trying to find similar string pairs in a large string set. I have around 100 million strings, and string similarity is measured as edit distance. For example, "this is a sentence" and "this is also a sentence" are similar.

It is not practical to calculate the similarity between each two strings, resulting 100M x 100M calculations. I'm considering a divide-and-conquer strategy to first group strings into "roughly similar" subsets, and then calculate each string pair within a subset. For example, say I have the following 5 strings,

str1 = "this is a sentence"
str2 = "this is also a sentence"
str3 = "Mary loves elephants"
str4 = "Mary loves an elephant"
str5 = "Mark loves elephants"

I hope to have a subset [str1, str2] and another subset [str3, str4, str5]. Then I'll compare str1 and str2 to see if they are similar. I'll also compare str3, str4, str5 to find a similar pair. The total calculations will be reduced from C^2_5=10 to C^2_2+C^2_3=4.

The dividing needs to be fast, and therefore does not need to be accurate. Subsets can be overlapping. And it is acceptable if occasionally a string's similar pair is not included in the same subset, -- then I'll scan a near subset.

I was trying to find an order-preserved hash method to roughly map strings to integers (collision does not matter), and check each string against candidate strings with close integers. But I fail to find such an algorithm.

I'm using Python, and I'll be willing to try if a solution is only applicable in another programming language.

Thank you very much.


Solution

  • You could use Levenshtein distance as a key function when sorting.

    import requests
    import Levenshtein as L
    
    def download_moby_dick():
        moby_dick_url = 'https://www.gutenberg.org/files/2701/2701-0.txt'
        return requests.get(moby_dick_url).text
    
    def sentences_in_book(book):
        sentences = (s for s in re.split(r'[.;?!]\s|\r\n\r\n', moby_dick))
        sentences = (re.sub('\s+', ' ', s).strip() for s in sentences)
        sentences = (s for s in sentences if len(s) > 10)
        return list(sentences)
    
    sentences = sentences_in_book(download_moby_dick())
    
    # sort by length
    sentences.sort(key=len)
    
    # median length sentence
    target = sentences[len(sentences)//2]
    
    # sort by levenshtein distance to target
    def keyfunc(sentence):
        return L.distance(target, sentence) 
    
    sentences.sort(key=keyfunc)
    

    This will give a rough order where similar sentences are grouped together. To speed it up, you might have to split up the task further. For example abbreviate the input sentences by only using some letters from each word, only search sentences that have roughly the same length, etc.