In python replacing characters from multiple char maps

I haven't been able to find a solution to this problem, and it's for a workaround in some bad platform code that I can't do anything about. I want to render UTF-8 strings but the platform crashes if it receives a character out side it's supported character maps. In the case here, I have German Navi unit in Russia - Latin 2 (iso-8859-2) and Cyrillic (iso-8859-5) are supported but the platform crashes on an Arabic character. So I want to filter out anything that is not German or Russian.

This code:

import codecs
import string

if __name__ == '__main__':
    s = u'Ivan Krsti\u0107\u0416'

    print s

    print s.encode ('iso-8859-1', 'replace')
    print s.encode ('iso-8859-5', 'replace').decode('iso-8859-5')
    print s.encode ('iso-8859-2', 'replace').decode('iso-8859-2')

Produces

Ivan KrstićЖ 
Ivan Krsti??
Ivan Krsti?Ж
Ivan Krstić?

My question is how to I combine the character maps for 'iso-8859-2' and 'iso-8859-5' so I get the first result after filtering? (Assume that I've already encoded UTF-8 to unicode.)

Solution

You can produce all codepoints that are valid for either codec using sets:

iso_8859_2 = {chr(i).decode('iso-8859-2') for i in xrange(0xff)}
iso_8859_5 = {chr(i).decode('iso-8859-5') for i in xrange(0xff)}
combined = iso_8859_2 | iso_8859_5

and then make that into a regular expression:

import re
# escape meta characters
invalid = u''.join(combined).replace('-', r'\-').replace(']', r'\]')
invalid = re.compile(u'([^{}])'.format(invalid))

and apply that to Unicode text to filter out all codepoints that fall outside those codepoints:

text_using_only_iso_8859_2_or_5 = invalid.sub('', unicodetext)

This then removes any codepoints that are not in either of the given character sets.

You could also work with unicode.translate(), which takes a mapping of codepoints (integers) to new codepoints, or Noneto remove characters:

all_of_unicode = set(range(0x10ffff))
iso_8859_2 = {ord(chr(i).decode('iso-8859-2')) for i in xrange(0xff)}
iso_8859_5 = {ord(chr(i).decode('iso-8859-5')) for i in xrange(0xff)}
# map the difference to None values
to_remove = dict.fromkeys(all_of_unicode - iso_8859_2 - iso_8859_5)
text_using_only_iso_8859_2_or_5 = unicodetext.translate(to_remove)