Search code examples
pythonemailencodingimaplibrfc822

Emails sent from mobile devices are strangely decoded using email lib


I'm using Python imaplib and email modules to grab a list of emails from smtp and do something with them afterwards. This is the snippet I'm using to grab and decode the emails:

import imaplib
import email

# Connect to server
box = imaplib.IMAP4(CSMTP_SERVER)
box.login(CSMTP_USERNAME, CSMTP_PASSWORD)

# List inbox
box.select('INBOX')

# Retrieve email list ID's matching search patterns
# Return from search is this:
# ('OK', ['1 2 3 4 5 6 7 8 9 10 11 12 13 14'])
data = box.search(None, 'ALL')[1]
for num in data[0].split():

# Retrieve message headers and body
headers = email.message_from_string(box.fetch(num, '(RFC822)')[1][0][1])
body = headers.get_payload()
if not isinstance(body, str):
    body = headers.get_payload()[0].get_payload()

print headers, body

This works like a charm when the email is sent from Hotmail or Gmail but whenever an email is sent, for example, from the Android default mailing APP the message will look like this:

=?utf-8?B?RndkOiBDYXBzaGFyZTogaW1wb3J0aW5nIGZyb20gUGhvdG9z?
U2VudCBmcm9tIG15IEhUQwoKLS0tLS0gRm9yd2FyZGVkIG1lc3NhZ2UgLS0tLS0KRnJvbTogIkFs
ZXhhbmRlciBBdnRhbnNraSIgPGFsZXhAYXZ0YW5za2kuY29tPgpUbzogIlBlam1hbiBNYWtoZmki
IDxwakBtYWtoZmkuY29tPgpTdWJqZWN0OiBDYXBzaGFyZTogaW1wb3J0aW5nIGZyb20gUGhvdG9z
CkRhdGU6IFdlZCwgU2VwIDEwLCAyMDE0IDk6MDYgUE0KCkhpIFBlam1hbiwKCkkgd2FzIHBsYXlp
bmcgd2l0aCBDYXBzaGFyZSB0b2RheSBhbmQgZm91bmQgc29tZXRoaW5nIG1pc3NpbmcuIEkgZ3Vl
c3MgeW91CmhhdmUgcGxhbnMgZm9yIGl0LCBidXQgaXQgZG9lc24ndCBodXJ0IHRvIG1lbnRpb24g
aXQsIGp1c3Qgb24gY2FzZS4uLgoKV2hlbiBpbXBvcnRpbmcgcGhvdG9zLCBJIGhhdmUgdGhlIG9w
dGlvbiB0byBlaXRoZXIgZ2V0IG9uZSBvZiB0aGUgaW1hZ2VzCnRoYXQgYXJlIGRvd25sb2FkZWQg
b24gbXkgcGhvbmUsIG9yIHRvIHRha2UgYSBuZXcgcGljdHVyZS92aWRlby4gV2hhdCdzCm1pc3Np
bmcgaXMgYWJpbGl0eSB0byBnZXQgcGhvdG9zIGZyb20gbXkwcyBJJ3ZlIHVzZWQgZG9uJ3Qg
Y2FyZSB3aGVyZSB0aGUgcGhvdG8gaXMgbG9jYXRlZCBhbmQgYWxsCnBpY3R1cmVzIGFyZSBlcXVh
bGx5IGFjY2Vzc2libGUgKG9yIG1heWJlIHRoaXMgYXBwbGllcyBvbmx5IHRvIEdvb2dsZQphcHBz
PykuCgpOb3QgaW1wb3J0YW50LCBubyBpZGVhIGlmIGl0IGlzIGp1c3QgYSBsaW5lIG9yIHR3byBm
aXggb3Igc29tZXRoaW5nIG1vcmUKY29tcGxpY2F0ZWQuCgpUYWtlIGNhcmUsCgotIEFsZXgsIGJl
dGEgdGVzdGVyLCBRQSB2b2x1bnRlZXIsIGFuZCBzZW5pb3IgcGVza3kgc3RpY2tsZXI=

When I got this message I was sending the email from my mobile device. I doubt this has something to do, more like it's something about some emailers not building correctly the headers for the emails based on the RFC822 but I need to fix this somehow and be able to retrieve every email.

I would appreciate some hints about how to handle this. Thanks in advance.


Solution

  • This is a MIME message - it's not specified on RFC822, but rather on the newer 2045-2047.

    The vast majority of modern email uses MIME in some way, so you should definitely support it.

    Of particular relevance to this message, is RFC2047, which specifies Encoded-Word. There is a good overview on wikipedia, which I'll partially transcribe:

    The form is: "=?charset?encoding?encoded text?=".

    encoding can be either "Q" denoting Q-encoding that is similar to the quoted-printable encoding, or "B" denoting base64 encoding.

    So, for this particular message, you have a Base64 encoded (B) utf-8 encoded text. The actual message starts right after B?, and not on the second line.

    Here's some simple python code to handle all this:

    if body.startswith("=?"):
        i1= body.index("?")
        i2= body.index("?", i1+1)
        i3= i2+2
        encoding= body[i1+1:i2]
        assert body[i2:i3]=="?B" #don't handle Q format, it's not commonly used
        body= base64.b64decode(body[i3+1:]).decode(encoding)