Search code examples
pythonimapimap-tools

imap_tools Taking Long Time to Scrape Links from Emails


I am using imap_tools to get links from emails. The emails are very small with very little text, graphics, etc. There are also not many, around 20-40 spread through the day.

When a new email arrives it takes between 10 and 25 seconds to scrape the link. This seems very long. I would have expected it to be less than 2 seconds and speed is important.

Nb. it is a shared mailbox and I cannot simply fetch unseeen emails because often other users will have opened emails before the scraper gets to them.

Can anyone see what the issue is?

import pandas as pd
from imap_tools import MailBox, AND
import re, time, datetime, os
from config import email, password

uids = []
yahooSmtpServer = "imap.mail.yahoo.com"
data = {
    'today': str(datetime.datetime.today()).split(' ')[0],
    'uids': []
    }
while True:
    while True:
        try:
            client = MailBox(yahooSmtpServer).login(email, password, 'INBOX')
            try:
                if not data['today'] == str(datetime.datetime.today()).split(' ')[0]:
                    data['today'] = str(datetime.datetime.today()).split(' ')[0]
                    data['uids'] = []
                ds = str(datetime.datetime.today()).split(' ')[0].split('-')
                msgs = client.fetch(AND(date_gte=datetime.date.today()))
                for msg in msgs:
                    links = []
                    if str(datetime.datetime.today()).split(' ')[0] == str(msg.date).split(' ')[0] and not msg.uid in data['uids']:
                        mail = msg.html
                        if 'order' in mail and not 'cancel' in mail:
                            for i in re.findall(r'(https?://[^\s]+)', mail):
                                if 'pick' in i:
                                    link = i.replace('"', "")
                                    link = link.replace('<', '>').split('>')[0]
                                    print(link)
                                    links.append(link)
                                    break
                        data['uids'].append(msg.uid)
                        scr_links = pd.DataFrame({'Links': links})
                        scr_links.to_csv('Links.csv', mode='a', header=False, index=False)
                        time.sleep(0.5)
            except Exception as e:
                print(e)
                pass
            client.logout()
            time.sleep(5)
        except Exception as e:
            print(e)
            print('sleeping for 5 sec')
            time.sleep(1)

Solution

  • I think this is email server throttle timeout.

    Try to see IMAP IDLE.

    since 0.51.0 imap_tools has IDLE support:

    https://github.com/ikvk/imap_tools/releases/tag/v0.51.0