Search code examples
pythonemailunicodeemoji

How to get email.message_from_bytes to work with unicode input


When email.message_from_bytes() is given input with unicode/emoji in the headers, the resulting output results in unexpected TypeErrors. Is it possible to process the input (encoding, decoding, etc) before passing it to message_from_bytes() to prevent these TypeErrors?

The overall goal is to get gyb.py to successfully clean + restore backups from gyb-generated .eml files, some of which contain unicode/emoji in the email headers. Also the unicode/emoji should be preserved without mangling them (like in the sample output.)

Minimal reproduction:

import email
f = open('./sample.eml', 'rb')
bytes = f.read()
message = email.message_from_bytes(bytes)

# No unicode/emoji: works as expected:
print(message['to'])
print(len(message['to']))

# With unicode/emoji: unexpected TypeError:
print(message['from'])
print(len(message['from']))

sample.eml

To: recipient <to@example.com>
From:🔥sender🔥 <from@example.com>

output:

$ python check-message.py
recipient <to@example.com>
26
����sender���� <from@example.com>
Traceback (most recent call last):
  File "V:\gyb\jkm\check-message.py", line 10, in <module>
    print(len(message['from']))
          ^^^^^^^^^^^^^^^^^^^^
TypeError: object of type 'Header' has no len()

Github issues related to larger gyb restore problem


Solution

  • Call email.message_from_bytes with policy email.policy.SMTPUTF8.

    import email, email.policy
    with open('./sample.eml', 'rb') as f:
        message = email.message_from_bytes(f.read(), policy=email.policy.SMTPUTF8)
    
        print(message['from'])
        print(len(message['from']))
    

    This successfully outputs:

    🔥sender🔥 <from@example.com>
    27