I haven't been able to find a solution to this problem, and it's for a workaround in some bad platform code that I can't do anything about. I want to render UTF-8 strings but the platform crashes if it receives a character out side it's supported character maps. In the case here, I have German Navi unit in Russia - Latin 2 (iso-8859-2) and Cyrillic (iso-8859-5) are supported but the platform crashes on an Arabic character. So I want to filter out anything that is not German or Russian.
This code:
import codecs
import string
if __name__ == '__main__':
s = u'Ivan Krsti\u0107\u0416'
print s
print s.encode ('iso-8859-1', 'replace')
print s.encode ('iso-8859-5', 'replace').decode('iso-8859-5')
print s.encode ('iso-8859-2', 'replace').decode('iso-8859-2')
Produces
Ivan KrstićЖ
Ivan Krsti??
Ivan Krsti?Ж
Ivan Krstić?
My question is how to I combine the character maps for 'iso-8859-2' and 'iso-8859-5' so I get the first result after filtering? (Assume that I've already encoded UTF-8 to unicode.)
You can produce all codepoints that are valid for either codec using sets:
iso_8859_2 = {chr(i).decode('iso-8859-2') for i in xrange(0xff)}
iso_8859_5 = {chr(i).decode('iso-8859-5') for i in xrange(0xff)}
combined = iso_8859_2 | iso_8859_5
and then make that into a regular expression:
import re
# escape meta characters
invalid = u''.join(combined).replace('-', r'\-').replace(']', r'\]')
invalid = re.compile(u'([^{}])'.format(invalid))
and apply that to Unicode text to filter out all codepoints that fall outside those codepoints:
text_using_only_iso_8859_2_or_5 = invalid.sub('', unicodetext)
This then removes any codepoints that are not in either of the given character sets.
You could also work with unicode.translate()
, which takes a mapping of codepoints (integers) to new codepoints, or None
to remove characters:
all_of_unicode = set(range(0x10ffff))
iso_8859_2 = {ord(chr(i).decode('iso-8859-2')) for i in xrange(0xff)}
iso_8859_5 = {ord(chr(i).decode('iso-8859-5')) for i in xrange(0xff)}
# map the difference to None values
to_remove = dict.fromkeys(all_of_unicode - iso_8859_2 - iso_8859_5)
text_using_only_iso_8859_2_or_5 = unicodetext.translate(to_remove)