I am currently working on a project in Python that would be connecting to an email server and looking at the latest email to tell the user if there is an attachment or a link embedded in the email. I have the former working but not the latter.
I may be having troubles with the if any() part of my script. As it seems to half work when I test. Although it may be due to how the email string is printed out?
Here is my code for connecting to gmail and then looking for the link.
import imaplib
import email
word = ["http://", "https://", "www.", ".com", ".co.uk"] #list of strings to search for in email body
#connection to the email server
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('email@gmail.com', 'password')
mail.list()
# Out: list of "folders" aka labels in gmail.
mail.select("Inbox", readonly=True) # connect to inbox.
result, data = mail.uid('search', None, "ALL") # search and return uids instead
ids = data[0] # data is a list.
id_list = ids.split() # ids is a space separated string
latest_email_uid = data[0].split()[-1]
result, data = mail.uid('fetch', latest_email_uid, '(RFC822)') # fetch the email headers and body (RFC822) for the given ID
raw_email = data[0][1] # here's the body, which is raw headers and html and body of the whole email
# including headers and alternate payloads
print "---------------------------------------------------------"
print "Are there links in the email?"
print "---------------------------------------------------------"
msg = email.message_from_string(raw_email)
for part in msg.walk():
# each part is a either non-multipart, or another multipart message
# that contains further parts... Message is organized like a tree
if part.get_content_type() == 'text/plain':
plain_text = part.get_payload()
print plain_text # prints the raw text
if any(word in plain_text for word in word):
print '****'
print 'found link in email body'
print '****'
else:
print '****'
print 'no link in email body'
print '****'
So basically as you can see I have a variable called 'Word' which contains an array of keywords to search for in the plain text email.
When I send a test email with an embedded link that is in the format of 'http://' or 'https://' - the email prints out the email body with the link in the text like this -
---------------------------------------------------------
Are there links in the email?
---------------------------------------------------------
Test Link <http://www.google.com/>
****
found link in email body
****
And I get my print message saying 'found link in email body' - which is the result I am looking for in my test phase, yet this will lead onto something else to happen within the final program.
Yet, if I add an embedded link in the email with no http:// such as google.com then the link doesn't print out and I don't get the result, even though I have an embedded link.
Is there a reason for this? I'm also suspecting maybe my if any() loops is not really the best. I didn't really understand it when I originally added it but it worked for http:// links. Then I tried just a .com and got my problem which I am having trouble finding a solution for.
To check if there are attachments to an e-mail you can search the headers for Content-Type and see if it says "multipart/*"
. E-mails with multipart content types may contain attachments.
To inspect the text for links, images, etc, you can try using Regular Expressions. As a matter of fact, this is probably your best option in my opinion. With regex (or Regular Expressions) you can find strings that match a given pattern. The pattern "<a[^>]+href=\"(.*?)\"[^>]*>(.*)?</a>"
, for example, should match all links in your email message regardless of whether they are a single word or a full URL. I hope that helps!
Here's an example of how you can implement this in Python:
import re
text = "This is your e-mail body. It contains a link to <a
href='http//www.google.com'>Google</a>."
link_pattern = re.compile('<a[^>]+href=\'(.*?)\'[^>]*>(.*)?</a>')
search = link_pattern.search(text)
if search is not None:
print("Link found! -> " + search.group(0))
else:
print("No links were found.")
For the "end-user" the link will just appear as "Google", without www and much less http(s)... However, the source code will have the html wrapping it, so by inspecting the raw body of the message you can find all links.
My code is not perfect but I hope it gives you a general direction... You can have multiple patterns looked up in your e-mail body text, for image occurences, videos, etc. To learn Regular Expressions you'll need to research a little, here's another link, to Wikipedia