Search code examples
pythondjangomimeamazon-ses

Encoding error: in MIME file data via AWS SES


I am trying to retrieve attachments data like file format and name of file from MIME via aws SES. Unfortunately some time file name encoding is changed, like file name is "3_amrishmishra_Entry Level Resume - 02.pdf" and in MIME it appears as '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?=', any way to get exact file name?

if email_message.is_multipart():
message = ''
if "apply" in receiver_email.split('@')[0].split('_')[0] and isinstance(int(receiver_email.split('@')[0].split('_')[1]), int):
    for part in email_message.walk():
        content_type = str(part.get_content_type()).lower()
        content_dispo = str(part.get('Content-Disposition')).lower()
        print(content_type, content_dispo)

        if 'text/plain' in content_type and "attachment" not in content_dispo:
            message = part.get_payload()


        if content_type in ['application/pdf', 'text/plain', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'image/jpeg', 'image/jpg', 'image/png', 'image/gif'] and "attachment" in content_dispo:
            filename = part.get_filename()
            # open('/tmp/local' + filename, 'wb').write(part.get_payload(decode=True))
            # s3r.meta.client.upload_file('/tmp/local' + filename, bucket_to_upload, filename)

            data = {
                'base64_resume': part.get_payload(),
                'filename': filename,
            }
            data_list.append(data)
    try:
        api_data = {
            'email_data': email_data,
            'resumes_data': data_list
        }
        print(len(data_list))
        response = requests.post(url, data=json.dumps(api_data),
                                 headers={'content-type': 'application/json'})
        print(response.status_code, response.content)
    except Exception as e:
        print("error %s" % e)

Solution

  • This syntax '=?UTF-8?Q?...?=' is a MIME encoded word. It is used in MIME email when a header value includes non-ASCII characters (gory details in RFC 2047). Your attachment filename includes an "en dash" character, which is why it was sent with this encoding.

    The best way to handle it depends on which Python version you're using...

    Python 3

    Python 3's updated email.parser package can correctly decode RFC 2047 headers for you:

    # Python 3
    from email import message_from_bytes, policy
    
    raw_message_bytes = b"<< the MIME message you downloaded from SES >>"
    message = message_from_bytes(raw_message_bytes, policy=policy.default)
    for attachment in message.iter_attachments():
        # (EmailMessage.iter_attachments is new in Python 3)
        print(attachment.get_filename())
        # amrishmishra_Entry Level Resume – 02.pdf
    

    You must specifically request policy.default. If you don't, the parser will use a compat32 policy that replicates Python 2.7's buggy behavior—including not decoding RFC 2047. (Also, early Python 3 releases were still shaking out bugs in the new email package, so make sure you're on Python 3.5 or later.)

    Python 2

    If you're on Python 2, the best option is upgrading to Python 3.5 or later, if at all possible. Python 2's email parser has many bugs and limitations that were fixed with a massive rewrite in Python 3. (And the rewrite added handy new features like iter_attachments() shown above.)

    If you can't switch to Python 3, you can decode the RFC 2047 filename yourself using email.header.decode_header:

    # Python 2 (also works in Python 3, but you shouldn't need it there)
    from email.header import decode_header
    
    filename = '=?UTF-8?Q?amrishmishra=5FEntry_Level_Resume_=E2=80=93_02=2Epdf?='
    decode_header(filename)
    # [('amrishmishra_Entry Level Resume \xe2\x80\x93 02.pdf', 'utf-8')]
    
    (decoded_string, charset) = decode_header(filename)[0]
    decoded_string.decode(charset)
    # u'amrishmishra_Entry Level Resume – 02.pdf'
    

    But again, if you're trying to parse real-world email in Python 2.7, be aware that this is probably just the first of several problems you'll encounter.

    The django-anymail package I maintain includes a compatibility version of email.parser.BytesParser that tries to work around several (but not all) other bugs in Python 2.7 email parsing. You may be able to borrow that (internal) code for your purposes. (Or since you tagged your question Django, you might want to look into Anymail's normalized inbound email handling, which includes Amazon SES support.)