Search code examples
c#ocrlevenshtein-distancefuzzy-search

Fuzzy matching multiple words in string


I'm trying to employ the help of the Levenshtein Distance to find fuzzy keywords(static text) on an OCR page.
To do this, I want to give a percentage of errors that are allowed (say, 15%).

string Keyword = "past due electric service";

Since the keyword is 25 characters long, I want to allow for 4 errors (25 * .15 rounded up)
I need to be able to compare it to...

string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank 
                          you! current electric service total balances unpaid 7 
                          days after the total due date are subject to a late 
                          charge of 7.5% of the amount due or $2.00, whichever/5 
                          greater. "

This is how I am doing it now...

int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202   
int NumberOfErrorsAllowed = 4;   
int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205

Clearly, Keyword is not found in OCR_Text (which it shouldn't be). But, using Levenshtein's Distance, the number of errors is less than the 15% leeway (therefore my logic says it's found).

Does anyone know of a better way to do this?


Solution

  • Answered My Question with the use of sub-strings. Posting in case others run into the same type of problem. A little unorthodox, but it works great for me.

    int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have.
    int LowestLevenshteinNumber = 999999; //initialize insanely high maximum
    decimal PossibleStringLength = (PossibleString.Length); //Length of string to search
    decimal StaticTextLength = (StaticText.Length); //Length of text to search for
    decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage
    
        //Look for best match with 1 less character than it should have, then the correct amount of characters.
        //And last, with 1 more character. (This is because one letter can be recognized as 
        //two (W -> VV) and visa versa) 
    
    for (int i = 0; i < 3; i++) 
    {
        for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
        {
            string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
            int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
            int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);
    
            if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
            {
                PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber });
                LowestLevenshteinNumber = lNumber;
            }
        }
        TextLengthBuffer++;
    }
    
    
    
    
    public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];
    
        if (n == 0)
        {
            return m;
        }
    
        if (m == 0)
        {
            return n;
        }
    
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }
    
        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }
    
        for (int i = 1; i <= n; i++)
        {
            for (int j = 1; j <= m; j++)
            {
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
    
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        return d[n, m];
    }