Search code examples
pythonregexfuzzy-searchpypi-regex

Fuzzy regex (e.g. {e<=2}) correct usage in Python


I am trying to find strings which are at most two mistakes 'away' from the original pattern string (i.e. they differ by at most two letters).

However, the following code isn't working as I would expect, at least not from my understanding of fuzzy regex:

import regex
res = regex.findall("(ATAGGAGAAGATGATGTATA){e<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)
print res
>> ['ATAGAGCAAGATGATGTATA']  # the second string

As you can see, the two strings differ on three letters rather than at most two:

the first has: ATAGGAGAAGATGATGTATA

the second has: ATAGAGCAAGATGATGTATA

and yet the result shows the second string, as though it's within e<=2 (this also happens with overlapped=False, so that can't be it).

What am I missing here? And is there any way of getting this to find only strings within the Hamming 2-ball of the given pattern?

Is it possible that a swap of letters is considered to be only one change? And if so - how can I avoid this?


Solution

  • Let's check this snippet for fuzzy counts:

    >>> pattern_string = 'ATAGGAGAAGATGATGTATA'
    >>> query_string = 'ATAGAGCAAGATGATGTATA'
    >>> r = regex.compile('(%s){e<=2}' % pattern_string)
    >>> r.match(query_string)
    <regex.Match object; span=(0, 20), match='ATAGAGCAAGATGATGTATA', fuzzy_counts=(0, 1, 1)>
    

    fuzzy_counts=(0, 1, 1) means that in this case, we get no substitutions, 1 insertion, and 1 deletion. So your filter works because the total count of errors is 2.

    But it seems that you need to filter only by substitutions count, so you can modify the regex:

    import regex
    res = regex.findall("(ATAGGAGAAGATGATGTATA){s<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)
    print res
    

    Check this great example from docs:

    • {i<=3} permit at most 3 insertions, but no other types
    • {d<=3} permit at most 3 deletions, but no other types
    • {s<=3} permit at most 3 substitutions, but no other types
    • {i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions
    • {e<=3} permit at most 3 errors
    • {1<=e<=3} permit at least 1 and at most 3 errors

    • {i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions