Search code examples
pythonimapscreen-scraping

Using IMAP to get urls in an email not working correctly


I'm trying to find specific urls in an email, I want to be able to get every url containing a specific string. Here is my code :

import imaplib
import regex as re

def find_urls(string):
    regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    url = re.findall(regex,string)
    return([x[0] for x in url])

def save_matching_urls(username, password, sender, url_string):
    print("connecting to email, please wait...")
    con = imaplib.IMAP4_SSL("imap.gmail.com")
    con.login(username, password)
    con.select('INBOX')
    print("connected sucessfully, scraping email from " + sender)
    (_, data) = con.search(None, '(FROM {0})'.format(sender.strip()))
    ids = data[0].split()
    print(str(len(ids)) +" emails found")

    list_urls = []
    list_good_urls = []
    for mail in ids:
        result, data = con.fetch(mail, '(RFC822)') # fetch the email headers and body (RFC822) for the given ID
        raw_email = data[0][1]
        email = raw_email.decode("utf-8").replace("\r", '').replace("\t", '').replace(" ", "").replace("\n", "")
        list_url = find_urls(email)
        for url in list_url:
            if url_string in url:
                list_good_urls.append(url)

    print(str(len(list_good_urls)) + " urls found, saving...")
    with open("{}_urls.txt".format(sender), mode="a", encoding="utf-8") as file:
        for url in list_good_urls:
            file.write(url + '\n')
    print("urls saved !")

The first function is to find the urls containing the string specified. The other function connects to the mail inbox using imap and then tries to find and save matching urls from a specific sender.

To show the issue, I used the website : http://ismyemailworking.com/ that will send you an email containing two urls containing the string : "email" which are :

http://ismyemailworking.com/Block.aspx
http://ismyemailworking.com/Contact.aspx

The urls saved by the code (actually there is only one url found)

IsMyEmailWorking.com/Block.aspx=20to=20temporarily=20block==20your=20email=20address=20for=201=20hour.=20This=20solves=20the=20problem==2099%=20of=20the=20time.=20If=20after=20this=20you=20continue=20to=20have==20problems=20please=20contact=20us=20via=20the=20contact=20link=20on=20our==20website=20at=20http://IsMyEmailWorking.com/Contact.aspx

I don't know what part of the code is causing this issue, any help would be appreciated !


Solution

  • The variant:

    from imap_tools import MailBox, A
    from magic import find_urls
    
    with MailBox('imap.mail.com').login('[email protected]', 'pwd', 'INBOX') as mailbox:
        for msg in mailbox.fetch(A(all=True)):
            body = msg.text or msg.html
            urls = find_urls(body)
    

    *Regards, author of imap_tools

    https://github.com/ikvk/imap_tools