Search code examples
pythonstringnlpdistancesimilarity

How to find a similar substring inside a large string with a similarity score in python?


What I'm looking for is not just a plain similarity score between two texts. But a similarity score of a substring inside a string. Say:

text1 = 'cat is sleeping on the mat'.

text2 = 'The cat is sleeping on the red mat in the living room'.

In the above example, all the words of text1 are present in the text2 completely, hence the similarity should be 100%.

If some words of text1 are missing, the score shall be less.

I'm working with a large dataset of varying paragraph size, hence finding a smaller paragraph inside a bigger one with such similarity score is crucial.

I found only string similarities such as cosine similarities, difflib similarity etc. which compares two strings. But not about a score of substring inside another string.


Solution

  • Based on your description, how about:

    >>> a = "cat is sleeping on the mat"
    >>> b = "the cat is sleeping on the red mat in the living room"
    >>> a = a.split(" ")
    >>> score = 0.0
    >>> for word in a: #for every word in your string
            if word in b: #if it is in your bigger string increase score
                score += 1
    >>> score/len(a) #obtain percentage given total word number
    1.0
    

    In case it had a missing word for example:

    >>> c = "the cat is not sleeping on the mat"
    >>> c = c.split(" ")
    >>> score = 0.0
    >>> for w in c:
            if w in b:
                score +=1
    >>> score/len(c)
    0.875
    

    Additionally, you can do as @roadrunner suggest and split b and save it as a set to speed up your performance with b = set(b.split(" ")). This will reduce that part complexity to O(1) and improve the overall algorithm to a O(n) complexity.

    Edit: You say you already tried some metrics like Cosine Similarity etc. However I suspect you may benefit from checking the Levenshtein Distance similarity, which I suspect could be of some use in this case as addition to the solutions provided.