Search code examples
pythonunicodeencodingutf-8iso-8859-15

How to normalize unicode encoding for iso-8859-15 conversion in python?


I want to convert unicode string into iso-8859-15. These strings include the u"\u2019" (RIGHT SINGLE QUOTATION MARK see http://www.fileformat.info/info/unicode/char/2019/index.htm) character which is not part of the iso-8859-15 characters set.

In Python, how to normalize the unicode characters in order to match the iso-8859-15 encoding?

I have looked at the unicodedata module without success. I manage to do the job with

s.replace(u"\u2019", "'").encode('iso-8859-15')

but I would like to find a more general and cleaner way.

Thanks for your help


Solution

  • Use the unicode version of the translate function, assuming s is a unicode string:

    s.translate({ord(u"\u2019"):ord(u"'")})
    

    The argument of the unicode version of translate is a dict mapping unicode ordinals to unicode ordinals. Add to this dict other characters you cannot encode in your target encoding.

    You can build your mapping table in a little more readable form and create your mapping dict from it, for instance:

    char_mappings = [(u"\u2019", u"'"),
                     (u"`", u"'")]
    translate_mapping = {ord(k):ord(v) for k,v in char_mappings}
    

    From translate documentation:

    For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example).