Search code examples
pythonencodecodec

In python replacing characters from multiple char maps


I haven't been able to find a solution to this problem, and it's for a workaround in some bad platform code that I can't do anything about. I want to render UTF-8 strings but the platform crashes if it receives a character out side it's supported character maps. In the case here, I have German Navi unit in Russia - Latin 2 (iso-8859-2) and Cyrillic (iso-8859-5) are supported but the platform crashes on an Arabic character. So I want to filter out anything that is not German or Russian.

This code:

import codecs
import string

if __name__ == '__main__':
    s = u'Ivan Krsti\u0107\u0416'

    print s

    print s.encode ('iso-8859-1', 'replace')
    print s.encode ('iso-8859-5', 'replace').decode('iso-8859-5')
    print s.encode ('iso-8859-2', 'replace').decode('iso-8859-2')

Produces

Ivan KrstićЖ 
Ivan Krsti??
Ivan Krsti?Ж
Ivan Krstić?

My question is how to I combine the character maps for 'iso-8859-2' and 'iso-8859-5' so I get the first result after filtering? (Assume that I've already encoded UTF-8 to unicode.)


Solution

  • You can produce all codepoints that are valid for either codec using sets:

    iso_8859_2 = {chr(i).decode('iso-8859-2') for i in xrange(0xff)}
    iso_8859_5 = {chr(i).decode('iso-8859-5') for i in xrange(0xff)}
    combined = iso_8859_2 | iso_8859_5
    

    and then make that into a regular expression:

    import re
    # escape meta characters
    invalid = u''.join(combined).replace('-', r'\-').replace(']', r'\]')
    invalid = re.compile(u'([^{}])'.format(invalid))
    

    and apply that to Unicode text to filter out all codepoints that fall outside those codepoints:

    text_using_only_iso_8859_2_or_5 = invalid.sub('', unicodetext)
    

    This then removes any codepoints that are not in either of the given character sets.

    You could also work with unicode.translate(), which takes a mapping of codepoints (integers) to new codepoints, or Noneto remove characters:

    all_of_unicode = set(range(0x10ffff))
    iso_8859_2 = {ord(chr(i).decode('iso-8859-2')) for i in xrange(0xff)}
    iso_8859_5 = {ord(chr(i).decode('iso-8859-5')) for i in xrange(0xff)}
    # map the difference to None values
    to_remove = dict.fromkeys(all_of_unicode - iso_8859_2 - iso_8859_5)
    text_using_only_iso_8859_2_or_5 = unicodetext.translate(to_remove)