Search code examples
pythonmatchdistancen-gram

Find best substring match


I'm looking for a library or a method using existing libraries( difflib, fuzzywuzzy, python-levenshtein) to find the closest match of a string (query) in a text (corpus)

I've developped a method based on difflib, where I split my corpus into ngrams of size n (length of query).

import difflib
from nltk.util import ngrams

def get_best_match(query, corpus):
    ngs = ngrams( list(corpus), len(query) )
    ngrams_text = [''.join(x) for x in ngs]
    return difflib.get_close_matches(query, ngrams_text, n=1, cutoff=0)

it works as I want when the difference between the query and the matched string are just character replacements.

query = "ipsum dolor"
corpus = "lorem 1psum d0l0r sit amet"

match = get_best_match(query, corpus)
# match = "1psum d0l0r"

But when the difference is character deletion, it is not.

query = "ipsum dolor"
corpus = "lorem 1psum dlr sit amet"

match = get_best_match(query, corpus)
# match = "psum dlr si"
# expected_match = "1psum dlr"

Is there a way to get a more flexible result size ( as for expected_match ) ?

EDIT 1:

  • The actual use of this script is to match queries (strings) with a messy ocr output.
  • As I said in the question, the ocr can confound characters, and even miss them.
  • If possible consider also the case when a space is missing between words.
  • A best match, is the one that does not include characters from other words than those on the query.

EDIT 2:

The solution I use now is to extend the ngrams with (n-k)-grams for k = {1,2,3} to prevent 3 deletions. It's much better than the first version, but not efficient in terms of speed, as we have more than 3 times the number of ngrams to check. It is also a non generalizable solution.


Solution

  • This function finds best matching substring of variable length.

    The implementation considers the corpus as one long string, hence avoiding your concerns with spaces and unseparated words.

    Code summary: 1. Scan the corpus for match values in steps of size step to find the approximate location of highest match value, pos. 2. Find the substring in the vicinity of pos with the highest match value, by adjusting the left/right positions of the substring.

    from difflib import SequenceMatcher
    
    def get_best_match(query, corpus, step=4, flex=3, case_sensitive=False, verbose=False):
        """Return best matching substring of corpus.
    
        Parameters
        ----------
        query : str
        corpus : str
        step : int
            Step size of first match-value scan through corpus. Can be thought of
            as a sort of "scan resolution". Should not exceed length of query.
        flex : int
            Max. left/right substring position adjustment value. Should not
            exceed length of query / 2.
    
        Outputs
        -------
        output0 : str
            Best matching substring.
        output1 : float
            Match ratio of best matching substring. 1 is perfect match.
        """
    
        def _match(a, b):
            """Compact alias for SequenceMatcher."""
            return SequenceMatcher(None, a, b).ratio()
    
        def scan_corpus(step):
            """Return list of match values from corpus-wide scan."""
            match_values = []
    
            m = 0
            while m + qlen - step <= len(corpus):
                match_values.append(_match(query, corpus[m : m-1+qlen]))
                if verbose:
                    print(query, "-", corpus[m: m + qlen], _match(query, corpus[m: m + qlen]))
                m += step
    
            return match_values
    
        def index_max(v):
            """Return index of max value."""
            return max(range(len(v)), key=v.__getitem__)
    
        def adjust_left_right_positions():
            """Return left/right positions for best string match."""
            # bp_* is synonym for 'Best Position Left/Right' and are adjusted 
            # to optimize bmv_*
            p_l, bp_l = [pos] * 2
            p_r, bp_r = [pos + qlen] * 2
    
            # bmv_* are declared here in case they are untouched in optimization
            bmv_l = match_values[p_l // step]
            bmv_r = match_values[p_l // step]
    
            for f in range(flex):
                ll = _match(query, corpus[p_l - f: p_r])
                if ll > bmv_l:
                    bmv_l = ll
                    bp_l = p_l - f
    
                lr = _match(query, corpus[p_l + f: p_r])
                if lr > bmv_l:
                    bmv_l = lr
                    bp_l = p_l + f
    
                rl = _match(query, corpus[p_l: p_r - f])
                if rl > bmv_r:
                    bmv_r = rl
                    bp_r = p_r - f
    
                rr = _match(query, corpus[p_l: p_r + f])
                if rr > bmv_r:
                    bmv_r = rr
                    bp_r = p_r + f
    
                if verbose:
                    print("\n" + str(f))
                    print("ll: -- value: %f -- snippet: %s" % (ll, corpus[p_l - f: p_r]))
                    print("lr: -- value: %f -- snippet: %s" % (lr, corpus[p_l + f: p_r]))
                    print("rl: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r - f]))
                    print("rr: -- value: %f -- snippet: %s" % (rl, corpus[p_l: p_r + f]))
    
            return bp_l, bp_r, _match(query, corpus[bp_l : bp_r])
    
        if not case_sensitive:
            query = query.lower()
            corpus = corpus.lower()
    
        qlen = len(query)
    
        if flex >= qlen/2:
            print("Warning: flex exceeds length of query / 2. Setting to default.")
            flex = 3
    
        match_values = scan_corpus(step)
        pos = index_max(match_values) * step
    
        pos_left, pos_right, match_value = adjust_left_right_positions()
    
        return corpus[pos_left: pos_right].strip(), match_value
    

    Example:

    query = "ipsum dolor"
    corpus = "lorem i psum d0l0r sit amet"
    match = get_best_match(query, corpus, step=2, flex=4)
    print(match)
    ('i psum d0l0r', 0.782608695652174)
    

    Some good heuristic advice is to always keep step < len(query) * 3/4, and flex < len(query) / 3. I also added case sensitivity, in case that's important. It works quite well when you start playing with the step and flex values. Small step values gives better results but takes longer to compute. flex governs how flexible the length of the resulting substring is allowed to be.

    Important to note: This will only find the first best match, so if there are multiple equally good matches, only the first will be returned. To allow for multiple matches, change index_max() to return a list of indices for the n highest values of the input list, and loop over adjust_left_right_positions() for values in that list.