Search code examples
pythonemailhexdecodeimaplib

Decoding Hex when pulling Multilingual Email Data with imaplib in Python


How do I turn this:

With Best Regards, JS Chen*\r\n\r\n=E9=A0=8E=E9=82=A6=E7=A7=91=E6=8A=80=E8=82=A1=E4=BB=BD=E6=9C=89=E9=99=90=E5=\r\n=85=AC=E5=8F=B8/Chipbond Technology 

to this:

With Best Regards, JS Chen*\r\n\r\n頎邦科技股份有限公司/Chipbond Technology

using python?

I'm pulling mixed language email data using imaplib and it's giving me this hex code with equal signs in-between whenever there are other language characters

Here is my code:

mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('****@gmail.com', '*********')

mail.select('Inbox')

type, data = mail.search(None,'(SUBJECT "ULVAC RSH-820")')
mail_ids = data[0]
id_list = mail_ids.split()
print('searching...')

for num in data[0].split():
    typ, data = mail.fetch(num, '(RFC822)' )
    raw_email = data[0][1]
    raw_email_string = raw_email.decode('utf-8')
    email_message = email.message_from_string(raw_email_string)
    print('decoding..')
    for response_part in data:
        if isinstance(response_part, tuple):
            msg = email.message_from_string(response_part[1].decode('utf-8')) 

            if msg.is_multipart():
                print('de-partitioning...')
                for part in msg.walk():
                    ctype = part.get_content_type()
                    cdispo = str(part.get('Content-Disposition'))
                    if ctype == 'text/plain' and 'attachment' not in cdispo:
                        body = part.get_payload()


            else:
                body = msg.get_payload()

Solution

  • Your emails have a content transfer encoding, specifically, the Quoted-Printable encoding, which is used to make sure the email data stream is ASCII safe.

    Simply tell Python to decode the payload by passing in decode=True to the Message.get_payload() method:

    body_data = part.get_payload(decode=True)
    charset = part.get_param("charset", "ASCII")
    body = body_data.decode(charset, errors="replace")
    

    However, this does mean you'll be given binary data, even for text content types and so must explicitly decode the data. get_payload() is not that helpful here. It is also part of the legacy API; you want to switch to the newer Unicode-friendly API. Do so by using a policy other than the compat32 policy (the default) when loading a message:

    from email import policy
    
    # ...
    
    raw_email = data[0][1]
    # you may have to use policy.default instead, depending on the line endings
    # convention used.
    email_message = email.message_from_bytes(raw_email, policy=policy.SMTP)
    

    and further down

    msg = email.message_from_bytes(response_part[1], policy=policy.SMTP)
    

    Note that I don't decode the bytes value first, by using email.message_from_bytes() instead of email.message_from_string() you delegate decoding the data to the email parser.

    Now email_message is a email.message.EmailMessage instance instead of the older email.message.Message() type, and you can use the EmailMessage.get_content() method, which for text mime types will return a Unicode text string:

    body = part.get_content()