I already decoded a lot of email attachments filenames in my code.
But this particular filename breaks my code.
Here is a minimal example:
from email.header import decode_header
encoded_filename='=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
decoded_header=decode_header(encoded_filename) # --> [('SalesInvoiceQ1|\x04\xb5I\x95\xc1\xbd\xc9\xd0\xb9\xc1\x91\x98', 'utf-8')]
filename=str(decoded_header[0][0]).decode(decoded_header[0][1])
Exception:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 16: invalid start byte
Don't ask my how, but Thunderbird is able to decode this filename to: SalesInvoice-Report.pdf
How can I decode this with python like email clients apparently are able to?
There are two Encoded-Word sections in that header. You'd have to detect where one ends and one begins:
>>> print decode_header(encoded_filename[:28])[0]
('SalesInvoice', 'utf-8')
>>> print decode_header(encoded_filename[28:])[0]
('-Report.pdf', 'utf-8')
Apparently that's what Thunderbird does in this case; split the string into =?encoding?data?=
chunks. Normally these should be separated by \r\n
(CARRIAGE RETURN + LINE FEED) characters, but in your case they are mashed up together. If you re-introduce the \r\n
separator the value decodes correctly:
>>> decode_header(encoded_filename[:28] + '\r\n' + encoded_filename[28:])[0]
('SalesInvoice-Report.pdf', 'utf-8')
You could use a regular expression to extract the parts and re-introduce the separator:
import re
from email.header import decode_header
quopri_entry = re.compile(r'=\?[\w-]+\?[QB]\?[^?]+?\?=')
def decode_multiple(encoded, _pattern=quopri_entry):
fixed = '\r\n'.join(_pattern.findall(encoded))
output = [b.decode(c) for b, c in decode_header(fixed)]
return ''.join(output)
Demo:
>>> encoded_filename = '=?UTF-8?B?U2FsZXNJbnZvaWNl?==?UTF-8?B?LVJlcG9ydC5wZGY=?='
>>> decode_multiple(encoded_filename)
u'SalesInvoice-Report.pdf'
Of course, it could be that you have a bug in how you read the header in the first place. Make sure you don't accidentally destroy an existing \r\n
separator when extracting the encoded_filename
value.