Search code examples
pythonregexfuzzy-comparisontre-library

approximate RegEx in python with TRE: strange unicode behavior


I am trying to use the TRE-library in python to match misspelled input.
It is important, that it does handle utf-8 encoded Strings well.

an example:
The German capital's name is Berlin, but from the pronunciation it is the same, if people would write "Bärlin"

It is working so far, but if a non-ASCII character is on the first or second position of the detected String, neither the range nor the detected string itself is correct.

# -*- coding: utf-8 -*-
import tre

def apro_match(word, list):
    fz = tre.Fuzzyness(maxerr=3)
    pt = tre.compile(word)
    for i in l:
        m = pt.search(i,fz)
        if m:
            print m.groups()[0],' ', m[0]

if __name__ == '__main__':
    string1 = u'Berlín'.encode('utf-8')
    string2 = u'Bärlin'.encode('utf-8')    
    string3 = u'B\xe4rlin'.encode('utf-8')
    string4 = u'Berlän'.encode('utf-8')
    string5 = u'London, Paris, Bärlin'.encode('utf-8')
    string6 = u'äerlin'.encode('utf-8')
    string7 = u'Beälin'.encode('utf-8')

    l = ['Moskau', string1, string2, string3, string4, string5, string6, string7]

    print '\n'*2
    print "apro_match('Berlin', l)"
    print "="*20
    apro_match('Berlin', l)
    print '\n'*2

    print "apro_match('.*Berlin', l)"
    print "="*20
    apro_match('.*Berlin', l)

output

apro_match('Berlin', l)
====================
(0, 7)   Berlín
(1, 7)   ärlin
(1, 7)   ärlin
(0, 7)   Berlän
(16, 22)   ärlin
(1, 7)   ?erlin
(0, 7)   Beälin



apro_match('.*Berlin', l)
====================
(0, 7)   Berlín
(0, 7)   Bärlin
(0, 7)   Bärlin
(0, 7)   Berlän
(0, 22)   London, Paris, Bärlin
(0, 7)   äerlin
(0, 7)   Beälin

Not that for the regex '.*Berlin' it works fine, while for the regex 'Berlin'

u'Bärlin'.encode('utf-8')    
u'B\xe4rlin'.encode('utf-8')
u'äerlin'.encode('utf-8')

are not working, while

u'Berlín'.encode('utf-8')
u'Berlän'.encode('utf-8')
u'London, Paris, Bärlin'.encode('utf-8')
u'Beälin'.encode('utf-8')

work as expected.

Is there something I do wrong with the encoding? Do you know any trick?


Solution

  • You could use new regex library, it supports Unicode 6.0 and fuzzy matching:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    from itertools import ifilter, imap
    import regex as re
    
    def apro_match(word_re, lines, fuzzy='e<=1'):
        search = re.compile(ur'('+word_re+'){'+fuzzy+'}').search
        for m in ifilter(None, imap(search, lines)):
            print m.span(), m[0]
    
    def main():
        lst = u'Moskau Berlín Bärlin B\xe4rlin Berlän'.split()
        lst += [u'London, Paris, Bärlin']
        lst += u'äerlin Beälin'.split()
        print
        print "apro_match('Berlin', lst)"
        print "="*25
        apro_match('Berlin', lst)
        print 
        print "apro_match('.*Berlin', lst)"
        print "="*27
        apro_match('.*Berlin', lst)
    
    if __name__ == '__main__':
        main()
    

    'e<=1' means that at most one error of any kind is permitted. There are three types of errors:

    • Insertion, indicated by "i"
    • Deletion, indicated by "d"
    • Substitution, indicated by "s"

    Output

    apro_match('Berlin', lst)
    =========================
    (0, 6) Berlín
    (0, 6) Bärlin
    (0, 6) Bärlin
    (0, 6) Berlän
    (15, 21) Bärlin
    (0, 6) äerlin
    (0, 6) Beälin
    
    apro_match('.*Berlin', lst)
    ===========================
    (0, 6) Berlín
    (0, 6) Bärlin
    (0, 6) Bärlin
    (0, 6) Berlän
    (0, 21) London, Paris, Bärlin
    (0, 6) äerlin
    (0, 6) Beälin