When email.message_from_bytes()
is given input with unicode/emoji in the headers, the resulting output results in unexpected TypeErrors. Is it possible to process the input (encoding, decoding, etc) before passing it to message_from_bytes()
to prevent these TypeErrors?
The overall goal is to get gyb.py to successfully clean + restore backups from gyb-generated .eml files, some of which contain unicode/emoji in the email headers. Also the unicode/emoji should be preserved without mangling them (like in the sample output.)
Minimal reproduction:
import email
f = open('./sample.eml', 'rb')
bytes = f.read()
message = email.message_from_bytes(bytes)
# No unicode/emoji: works as expected:
print(message['to'])
print(len(message['to']))
# With unicode/emoji: unexpected TypeError:
print(message['from'])
print(len(message['from']))
sample.eml
To: recipient <to@example.com>
From:🔥sender🔥 <from@example.com>
output:
$ python check-message.py
recipient <to@example.com>
26
����sender���� <from@example.com>
Traceback (most recent call last):
File "V:\gyb\jkm\check-message.py", line 10, in <module>
print(len(message['from']))
^^^^^^^^^^^^^^^^^^^^
TypeError: object of type 'Header' has no len()
Github issues related to larger gyb restore problem
Call email.message_from_bytes
with policy email.policy.SMTPUTF8
.
import email, email.policy
with open('./sample.eml', 'rb') as f:
message = email.message_from_bytes(f.read(), policy=email.policy.SMTPUTF8)
print(message['from'])
print(len(message['from']))
This successfully outputs:
🔥sender🔥 <from@example.com>
27