Search code examples
pythonalgorithmstring-matchingsequencematcher

Find match percentage between two strings also taking intro consideration the order of the words - Python


I am looking for a way to output the match percentage while between two strings (ex: names) while also taking into consideration they might be the same but with the words in a different order. I tried using SequenceMatcher() but the results are only partialy satisfying:

a = "john doe"
b = "jon doe"
c = "doe john"
d = "jon d"
e = 'john do'

s = SequenceMatcher(None, a, b)
s.ratio()
0.9333333333333333

s = SequenceMatcher(None, a, c)
s.ratio()
0.5

s = SequenceMatcher(None, a, d)
s.ratio()
0.7692307692307693

s = SequenceMatcher(None, a, e)
s.ratio()
0.9333333333333333

I am ok with all but the second result. I notice that it does not take into consideration that c is contains the same words as a but in a different order.

Is there any other way to match strings and obtain a higher matching percentage in the case I mentioned above. It should also be taken into consideration that names may contain more than two words.

Thank you!


Solution

  • That depends on what you expect for the enhanced matching. If you think the second one should be 1.0, then it's simple: split the string into words, sort the words, then apply SM (SequenceMatcher). If you want a match penalty on the sorting, you could use any of the transformation functions to measure the distance between the two lists of words, and use that as a factor on the eventual match.

    Does that help move you along?