Search code examples
pythonemailparsingimapmime

Some links in fetched with IMAP email are missing


I'm extracting the links from the email using imaplib and email, but the result is missing the main link, although the others are there.

#Assume that I know the id of an email that I need to parse '599'
typ, email_data = mail.fetch('599', '(RFC822)')

msg = email.message_from_bytes(email_data[0][1])
print(msg.get_payload()[0].get_payload())

Here's my email with three links:

gmail

This is the result:

Today's highlights

Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions

This week, I was tutoring a student client of mine. We have been working ou= r way through using Auth0. It=E2=80=A6

Jay (https://medium.com/@second-link) in ProjectWT (https://medium.com/@third-link) =C2=B73 min read

Links two and three are absolutely identical to those in the email, but as you can see the first link is missing (also in all similar cases) and I can't understand why. Any help would be appreciated.

Adding the default policy is not helping.

message = email.message_from_bytes(msg_as_bytes, policy=policy.default)

Solution

  • The immediate problem seems to be that you are probably extracting links from a MIME part which simply contains only two links. The structure of the message is apparently something like

    -+ multipart/alternative
     -- text/plain
     -+ multipart/related
      -- text/html
      -- image/png
      -- image/png
    

    where your screen shot shows the text/html part with its related images, but the text excerpt shows the first text/plain part, and the link extraction targets that, too.

    In the general case, if you are processing a collection of messages from multiple senders using multiple email clients and sending multiple types of messages (some with embedded images, others perhaps a PDF attacment or a collection of CSV files), you will need to perform an analysis of each individual message's structure and decide which MIME part(s) you want to extract based on those results. But for the common case where the message's top-level structure is either just a single body part or a common multipart/alternative with a text/plain and a text/html rendering of the same "main" message (in any order), recent versions of Python offer a simple method which attempts to "do the right thing".

    As an aside, the email module in the standard library was overhauled in Python 3.6 to be more logical, versatile, and succinct; new code should target the (no longer very) new EmailMessage API. When you supply a policy argument to message_from_bytes, this is what you get (without it, you get the legacy email.message.Message API, also called "compat32" because it's compatible back to Python 3.2 and earlier. The new API was informally introduced in Python 3.3, though it only became the preferred and official API in 3.6.)

    With that, the following code should hopefully do what you want.

    msg = email.message_from_bytes(email_data[0][1], policy=default)
    print(msg.get_body())
    

    The new API should not require you to separately request decoding of the extracted body part's content transfer encoding, which was another problem with your original attempt.

    get_body() (which did not exist at all in the legacy API) lets you specify an ordered list of preferred MIME types, but the default preference list should do what you want in this case. It will prefer HTML if available, and otherwise fall back to plain text.

    For testing, here is a quick and dirty example message with the assumed structure. If you need more help, probably post a new question with a sample message (ideally pared down to just the essentials, and probably without the IMAP code which isn't relevant for this particular problem).

    From: tripleee <[email protected]>
    To: you <[email protected]>
    Subject: Simple multipart example
    MIME-Version: 1.0
    Content-type: multipart/alternative; boundary="snowden-risen-woodward-manning"
    
    --snowden-risen-woodward-manning
    Content-type: text/plain; charset=utf-8
    Content-transfer-encoding: quoted-printable
    
    Today's highlights
    
    Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions=
    
    --snowden-risen-woodward-manning
    Content-type: multipart/related; boundary="pol-pot-stalin-trump-mao"
    
    --pol-pot-stalin-trump-mao
    Content-type: text/html; charset=utf-8
    Content-transfer-encoding: quoted-printable
    
    <h1>Today's highlights</h1>
    
    <p><a href=3D"https://example.com/spam">=
    Web API in=C2=A0.Net 6.0 with Auth0 with Roles and Permissions=
    </a></p>
    <img src="cid:[email protected]"/>
    <img src="cid:[email protected]"/>
    
    --pol-pot-stalin-trump-mao
    Content-type: image/png
    Content-transfer-encoding: base64
    Content-id: <[email protected]>
    
    somebase64gobbledygook=
    --pol-pot-stalin-trump-mao
    Content-type: image/png
    Content-transfer-encoding: base64
    Content-id: <[email protected]>
    
    morebase64gobbledygook=
    --pol-pot-stalin-trump-mao--
    --snowden-risen-woodward-manning--