Search code examples
pythonpython-2.7emailunicodemime

Properly decoding a mime encoded email attachament name to unicode object


Making it simple and straightforward, I have such raw string, which is a filneme with Chinese characters:

=?utf-8?B?5L+d56iO5LuT5Y+R6LSn5pel5oqlMS4xOS0xLjIxLnhsc3g=?=

According to http://dogmamix.com/MimeHeadersDecoder/, the decoed version of it looks like following:

保税仓发货日报1.19-1.21.xlsx (which is right)

I am trying to decode this to get the following unicode string:

u'保税仓发货日报1.19-1.21.xlsx'

What am i doing is:

Step 1:

in_str = '=?utf-8?B?5L+d56iO5LuT5Y+R6LSn5pel5oqlMS4xOS0xLjIxLnhsc3g=?='
from email.header import decode_header
res = decode_header(in_str)

Then res is a list of tuples of following form:

[('\xe4\xbf\x9d\xe7\xa8\x8e\xe4\xbb\x93\xe5\x8f\x91\xe8\xb4\xa7\xe6\x97\xa5\xe6\x8a\xa51.19-1.21.xlsx', 'utf-8')]

What yields a question - why res[0][0] it's partialy a bytestring, and partially a normal raw string ('1.19-1.21.xlsx' is a raw part of string)? But let's carry on.

Step 2.

Let's decode this bytestring from utf-8, as I believe it is utf-8 encoded string (logical, right?)

filename = res[0][0].decode('utf-8')

I believe this should return a following unicode string:

u'保税仓发货日报1.19-1.21.xlsx'

But i get another bytestring instead (this time unicode):

u'\u4fdd\u7a0e\u4ed3\u53d1\u8d27\u65e5\u62a51.19-1.21.xlsx'

Which drives me nuts, as I believe I am doing stuff right.

BTW, yes I have read "Unicode HOWTO", still no idea how to fix it.


Solution

  • Continuing your example and using an IDE that supports the font characters:

    #!python2
    in_str = '=?utf-8?B?5L+d56iO5LuT5Y+R6LSn5pel5oqlMS4xOS0xLjIxLnhsc3g=?='
    from email.header import decode_header
    res = decode_header(in_str)
    for data,enc in res:
        print data.decode(enc)
    

    Output:

    保税仓发货日报1.19-1.21.xlsx
    

    In Python 2, you have to decode and print the strings to display properly.