Search code examples
pythonimapquoted-printable

How to get Email in UTF-8?


I am doing a Python script to get the mail sent by people on my email address.

I am using the ImapClient module, and I got the content of the e-mail but prototyped strangely, all my UTF-8 Characters are encoded, like this :

No=C3=ABl

Here is my piece of code :

    email_message = email.message_from_bytes(message_data[b'RFC822'])
    print(email_message.get_payload(0))

I tried also to add the decode=True arguments in my get_payload, but it returns me a NoneType.


Solution

  • You would have to first identify the email part you are interested in. Then, you would decode the part's content, according to that part's encoding. Each part may have a different encoding and/or character set. If you're interested in the main body of the email, this is usually the first part, which could be html, or could be plain text, depending on the program that sent it (some user agents, like gmail, will include both forms).

    You could use the email module's EmailMessage.walk() function over your message object to see the various attachment and their respective content types. The parts are separated from one another with a special "boundary" string (often random) that does not occur in the message body (to avoid ambiguity). It's easier to let the email module walk the parts for you -- especially since parts can nest.

    The snippet of text that you show in your question appears to be quoted-printable encoded. You can find an example conversion from quoted-printable to utf-8 here: Change "Quoted-printable" encoding to "utf-8"

    An example:

    I'm adding an example mock raw message below, which would represent the bytes that form the EmailMessage object. In an email, each section/part (main body, attachments, etc) can have a different content-type, charset, and transfer-encoding. Parts can embed sub-parts, but email messages will commonly have just a flat structure. For parts that are attachments, it would be also common to find a content-disposition value, which would indicate a suggested filename for the file content.

    Subject: Woah
    From: "Sébastien" <[email protected]>
    To: Bob <[email protected]>
    Content-Type: multipart/alternative; boundary="000000000000690fec05765c6a66"
    
    --000000000000690fec05765c6a66
    Content-Type: text/plain; charset="UTF-8"
    Content-Transfer-Encoding: quoted-printable
    
    S=C3=A9bastien est un pr=C3=A9nom.
    
    --000000000000690fec05765c6a66
    Content-Type: text/html; charset="UTF-8"
    Content-Transfer-Encoding: quoted-printable
    
    <div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div di=
    r=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"lt=
    r"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><div=
    dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr">...
    
    ...
    

    Once you select your part of interest, you have to use the encoding settings of that part to convert the payload properly. You would first undo any transfer encoding (e.g. quoted-printable), and decode the resulting string according to the charset.

    If the charset of the part you want is already UTF-8, then all you would have to do is to undo the content-transfer-encoding (e.g. remove quoted-printable sequences). However if the part's charset was different, say Latin-1, you would have to go from bytes to unicode and then back from unicode to utf8:

    # remove quoted-printable encoding
    unquoted = quopri.decodestring(mime_part_payload)
    
    # latin-1 in this case is the charset of the mime part header
    tmp_unicode = unquoted.decode('latin-1', errors='ignore')
    
    # encode to desired encoding
    u8 = tmp_unicode.encode('utf-8')