I am trying to use the TRE-library in python to match misspelled input.
It is important, that it does handle utf-8 encoded Strings well.
an example:
The German capital's name is Berlin, but from the pronunciation it is the same, if people would write "Bärlin"
It is working so far, but if a non-ASCII character is on the first or second position of the detected String, neither the range nor the detected string itself is correct.
# -*- coding: utf-8 -*-
import tre
def apro_match(word, list):
fz = tre.Fuzzyness(maxerr=3)
pt = tre.compile(word)
for i in l:
m = pt.search(i,fz)
if m:
print m.groups()[0],' ', m[0]
if __name__ == '__main__':
string1 = u'Berlín'.encode('utf-8')
string2 = u'Bärlin'.encode('utf-8')
string3 = u'B\xe4rlin'.encode('utf-8')
string4 = u'Berlän'.encode('utf-8')
string5 = u'London, Paris, Bärlin'.encode('utf-8')
string6 = u'äerlin'.encode('utf-8')
string7 = u'Beälin'.encode('utf-8')
l = ['Moskau', string1, string2, string3, string4, string5, string6, string7]
print '\n'*2
print "apro_match('Berlin', l)"
print "="*20
apro_match('Berlin', l)
print '\n'*2
print "apro_match('.*Berlin', l)"
print "="*20
apro_match('.*Berlin', l)
output
apro_match('Berlin', l)
====================
(0, 7) Berlín
(1, 7) ärlin
(1, 7) ärlin
(0, 7) Berlän
(16, 22) ärlin
(1, 7) ?erlin
(0, 7) Beälin
apro_match('.*Berlin', l)
====================
(0, 7) Berlín
(0, 7) Bärlin
(0, 7) Bärlin
(0, 7) Berlän
(0, 22) London, Paris, Bärlin
(0, 7) äerlin
(0, 7) Beälin
Not that for the regex '.*Berlin'
it works fine, while for the regex 'Berlin'
u'Bärlin'.encode('utf-8')
u'B\xe4rlin'.encode('utf-8')
u'äerlin'.encode('utf-8')
are not working, while
u'Berlín'.encode('utf-8')
u'Berlän'.encode('utf-8')
u'London, Paris, Bärlin'.encode('utf-8')
u'Beälin'.encode('utf-8')
work as expected.
Is there something I do wrong with the encoding? Do you know any trick?
You could use new regex
library, it supports Unicode 6.0 and fuzzy matching:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from itertools import ifilter, imap
import regex as re
def apro_match(word_re, lines, fuzzy='e<=1'):
search = re.compile(ur'('+word_re+'){'+fuzzy+'}').search
for m in ifilter(None, imap(search, lines)):
print m.span(), m[0]
def main():
lst = u'Moskau Berlín Bärlin B\xe4rlin Berlän'.split()
lst += [u'London, Paris, Bärlin']
lst += u'äerlin Beälin'.split()
print
print "apro_match('Berlin', lst)"
print "="*25
apro_match('Berlin', lst)
print
print "apro_match('.*Berlin', lst)"
print "="*27
apro_match('.*Berlin', lst)
if __name__ == '__main__':
main()
'e<=1'
means that at most one error of any kind is permitted. There are three types of errors:
apro_match('Berlin', lst)
=========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(15, 21) Bärlin
(0, 6) äerlin
(0, 6) Beälin
apro_match('.*Berlin', lst)
===========================
(0, 6) Berlín
(0, 6) Bärlin
(0, 6) Bärlin
(0, 6) Berlän
(0, 21) London, Paris, Bärlin
(0, 6) äerlin
(0, 6) Beälin