I am writing a function to compare two strings (The use case is to compare a bank statement with the original string created at the moment of invoicing). I am interested to know how many percent (fraction) of the smaller string compareSting
are inside the original string. At least 4 consecutive chars need to be taken into account. The order of matching does not matter.
def relStringMatch(originalString,compareString):
smallestMatch=4
originalString=originalString.upper()
compareString=compareString.upper()
stringLength=len(compareString)
lastTest=stringLength-smallestMatch
index=0
totalMatch=0
while index < lastTest:
nbChars = smallestMatch
found=False
while (index+nbChars) <= stringLength:
checkString=compareString[index:index+nbChars]
if originalString.find(checkString) <0:
if (nbChars==smallestMatch): nbChars=0
nbChars-=1
break
else: found=True
nbChars+=1
if found:
totalMatch+=nbChars
index+=nbChars
else: index+=1
return totalMatch / stringLength
The code is running well, as an example:
relStringMatch("9999EidgFinanzverwaltungsteuer", "EIDG. FINANZVERWALTUNG")
prints the result: 0.95
which is correct.
Now the question: is there a more elegant way to do the same task? If I will read this code again in a few years, I will probably not understand it any more...
Without reinventing the wheel, there are a number of well-defined metrics that you can use to compare strings and evaluate similarity, e.g. the Levenshtein distance:
https://en.wikipedia.org/wiki/Levenshtein_distance
For which python libraries implementing it already exist:
https://pypi.org/project/python-Levenshtein/
from Levenshtein import ratio
ratio('Hello world!', 'Holly grail!')
# 0.583333...
ratio('Brian', 'Jesus')
# 0.0