What I'm looking for is not just a plain similarity score between two texts. But a similarity score of a substring inside a string. Say:
text1 = 'cat is sleeping on the mat'.
text2 = 'The cat is sleeping on the red mat in the living room'.
In the above example, all the words of text1
are present in the text2
completely, hence the similarity should be 100%.
If some words of text1
are missing, the score shall be less.
I'm working with a large dataset of varying paragraph size, hence finding a smaller paragraph inside a bigger one with such similarity score is crucial.
I found only string similarities such as cosine similarities, difflib similarity etc. which compares two strings. But not about a score of substring inside another string.
Based on your description, how about:
>>> a = "cat is sleeping on the mat"
>>> b = "the cat is sleeping on the red mat in the living room"
>>> a = a.split(" ")
>>> score = 0.0
>>> for word in a: #for every word in your string
if word in b: #if it is in your bigger string increase score
score += 1
>>> score/len(a) #obtain percentage given total word number
1.0
In case it had a missing word for example:
>>> c = "the cat is not sleeping on the mat"
>>> c = c.split(" ")
>>> score = 0.0
>>> for w in c:
if w in b:
score +=1
>>> score/len(c)
0.875
Additionally, you can do as @roadrunner suggest and split b
and save it as a set to speed up your performance with b = set(b.split(" "))
. This will reduce that part complexity to O(1)
and improve the overall algorithm to a O(n)
complexity.
Edit: You say you already tried some metrics like Cosine Similarity etc. However I suspect you may benefit from checking the Levenshtein Distance similarity, which I suspect could be of some use in this case as addition to the solutions provided.