Search code examples
pythonpython-3.xpattern-matchingstring-comparisonfractions

Find relative match of two strings


I am writing a function to compare two strings (The use case is to compare a bank statement with the original string created at the moment of invoicing). I am interested to know how many percent (fraction) of the smaller string compareSting are inside the original string. At least 4 consecutive chars need to be taken into account. The order of matching does not matter.

def relStringMatch(originalString,compareString):

    smallestMatch=4

    originalString=originalString.upper()
    compareString=compareString.upper()

    stringLength=len(compareString)
    lastTest=stringLength-smallestMatch

    index=0
    totalMatch=0
    while index < lastTest:
        nbChars = smallestMatch
        found=False
        while (index+nbChars) <= stringLength:
            checkString=compareString[index:index+nbChars]
            if originalString.find(checkString) <0:
                if (nbChars==smallestMatch): nbChars=0
                nbChars-=1
                break
            else: found=True
            nbChars+=1
        if found:
            totalMatch+=nbChars
            index+=nbChars
        else: index+=1
    return totalMatch / stringLength

The code is running well, as an example:

relStringMatch("9999EidgFinanzverwaltungsteuer", "EIDG. FINANZVERWALTUNG")

prints the result: 0.95 which is correct.

Now the question: is there a more elegant way to do the same task? If I will read this code again in a few years, I will probably not understand it any more...


Solution

  • Without reinventing the wheel, there are a number of well-defined metrics that you can use to compare strings and evaluate similarity, e.g. the Levenshtein distance:

    https://en.wikipedia.org/wiki/Levenshtein_distance

    For which python libraries implementing it already exist:

    https://pypi.org/project/python-Levenshtein/

    from Levenshtein import ratio
    ratio('Hello world!', 'Holly grail!')
    # 0.583333...
    
    ratio('Brian', 'Jesus')
    # 0.0