Search code examples
pythonregexpython-regex

How does BestMatch findall decide how many results to return?


I'm struggling to predict how many fuzzy matches findall() will return when using regex in Python with BESTMATCH enabled:

>>> regex.findall(r'(?b)(North\ West){i<=0,s<=2,d<=1}', "South west South West North West", regex.V1)
['North West']

Does not match South West at all

>>> regex.findall(r'(?b)(North\ West){i<=0,s<=2,d<=1}', "South west South West North West North west South West", regex.V1)
['North West', 'North west', 'South West']

Matches South West

I'm not clear if this is a bug or as intended?


Solution

  • I think I have a partial explanation:

    The behaviour appears to be:

    • Return all perfect matches (with no substitutions, insertions or deletions)
    • Return all imperfect matches AFTER the last perfect match

    It appears (through testing), that BESTMATCH behaves like a normal search, except if it finds a perfect match, it throws away any preceding imperfect matches. Hence the behaviour that it returns a series of perfect matches, followed by zero or more imperfect matches.

    Some examples:

    >>> regex.findall(r'(?b)(abc){i<=0,s<=2,d<=1}', "abc abd abd aaa abc", regex.V1)
    ['abc', 'abc']
    >>> regex.findall(r'(?b)(abc){i<=0,s<=2,d<=1}', "abc abd abd aaa abc abb", regex.V1)
    ['abc', 'abc', 'abb']
    >>> regex.findall(r'(?b)(ab[cd]){i<=0,s<=2,d<=1}', "abc abd abd aaa abc abb", regex.V1)
    ['abc', 'abd', 'abd', 'abc', 'abb']
    >>> regex.findall(r'(?b)(ab[cd]){i<=0,s<=2,d<=1}', "abc abd abd aaa abc abb abc", regex.V1)
    ['abc', 'abd', 'abd', 'abc', 'abc']