I am looking for a way to output the match percentage while between two strings (ex: names) while also taking into consideration they might be the same but with the words in a different order. I tried using SequenceMatcher() but the results are only partialy satisfying:
a = "john doe"
b = "jon doe"
c = "doe john"
d = "jon d"
e = 'john do'
s = SequenceMatcher(None, a, b)
s.ratio()
0.9333333333333333
s = SequenceMatcher(None, a, c)
s.ratio()
0.5
s = SequenceMatcher(None, a, d)
s.ratio()
0.7692307692307693
s = SequenceMatcher(None, a, e)
s.ratio()
0.9333333333333333
I am ok with all but the second result. I notice that it does not take into consideration that c is contains the same words as a but in a different order.
Is there any other way to match strings and obtain a higher matching percentage in the case I mentioned above. It should also be taken into consideration that names may contain more than two words.
Thank you!
That depends on what you expect for the enhanced matching. If you think the second one should be 1.0, then it's simple: split the string into words, sort the words, then apply SM (SequenceMatcher
). If you want a match penalty on the sorting, you could use any of the transformation functions to measure the distance between the two lists of words, and use that as a factor on the eventual match.
Does that help move you along?