Search code examples
pythonencodingcharacter-encoding

Decode Bi-Directional bytes (e.g., 'iso-8859-8-i' and 'iso-8859-8-e') in Python


I'm working on a project involving email headers and got stuck trying to decode header content encoded with bi-directional character sets like 'iso-8859-8-i' and 'iso-8859-8-e.' The RFC 1556 (https://www.rfc-editor.org/rfc/rfc1556.html) defines these encodings.

When using the .decode Python function, I encounter this exception:

LookupError: unknown encoding: iso-8859-8-i

Any help or advice on handling both -i and -e suffixes would be awesome! Thanks!


Solution

  • I think you cannot do it. Python lacks support of all ECMA escape characters, shifts, etc. used e.g. to change encoding, or to use more bytes per characters.

    For -i: just use the normal decoder (iso-8859-8), and hope that display engine will do the correct things (so using Unicode algorithms).

    For the -e: possibly the same, and you can substitute the explicit directionality bytes into the Unicode codepoints to that (if your display engine handle them, else you must use markdown or other methods to send explicit directionality to the display engine).

    PS: possibly you can use source code of existing email programs to see how it is done (in reality, not on interpreting the standards).