Search code examples
pythonregexemailgmail-imapimaplib

How to scrape a link from a multipart email in python


I have a program which logs on to a specified gmail account and gets all the emails in a selected inbox that were sent from an email that you input at runtime.

I would like to be able to grab all the links from each email and append them to a list so that i can then filter out the ones i don't need before outputting them to another file. I was using a regex to do this which requires me to convert the payload to a string. The problem is that the regex i am using doesn't work for findall(), it only works when i use search() (I am not too familiar with regexes). I was wondering if there was a better way to extract all links from an email that doesn't involve me messing around with regexes?

My code currently looks like this:

print(f'[{Mail.timestamp}] Scanning inbox')
sys.stdout.write(Style.RESET)
self.search_mail_status, self.amount_matching_criteria = self.login_session.search(Mail.CHARSET,search_criteria)

if self.amount_matching_criteria == 0 or self.amount_matching_criteria == '0':
    print(f'[{Mail.timestamp}] No mails from that email address could be found...')
    Mail.enter_to_continue()
    import main
    main.main_wrapper()
else:
    pattern = '(?P<url>https?://[^\s]+)'
    prog = re.compile(pattern)

    self.amount_matching_criteria = self.amount_matching_criteria[0]
    self.amount_matching_criteria_str = str(self.amount_matching_criteria)
    num_mails = re.search(r"\d.+",self.amount_matching_criteria_str)
    num_mails = ((num_mails.group())[:-1]).split(' ')

    sys.stdout.write(Style.GREEN)
    print(f'[{Mail.timestamp}] Status code of {self.search_mail_status}')
    sys.stdout.write(Style.RESET)
    sys.stdout.write(Style.YELLOW)
    print(f'[{Mail.timestamp}] Found {len(num_mails)} emails')
    sys.stdout.write(Style.RESET)
    num_mails = self.amount_matching_criteria.split()
    for message_num in num_mails:
        individual_response_code, individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
        message = email.message_from_bytes(individual_response_data[0][1])
        if message.is_multipart():
            print('multipart')

            multipart_payload = message.get_payload()
            for sub_message in multipart_payload:
                string_payload = str(sub_message.get_payload())
                print(prog.search(string_payload).group("url"))

Solution

  • Ended up using this for loop with a recursive function and a regex to get the links, i then removed all links without a the substring that you can input earlier on in the program before appending to a set

    for message_num in self.amount_matching_criteria.split():
        counter += 1
        _, self.individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
        self.raw = email.message_from_bytes(self.individual_response_data[0][1])
        raw = self.raw
        self.scraped_email_value = email.message_from_bytes(Mail.scrape_email(raw))
        self.scraped_email_value = str(self.scraped_email_value)
        self.returned_links = prog.findall(self.scraped_email_value)
                      
        for i in self.returned_links:
        if self.substring_filter in i:
            self.link_set.add(i)
        self.timestamp = time.strftime('%H:%M:%S')
        print(f'[{self.timestamp}] Links scraped: [{counter}/{len(num_mails)}]')
    

    The function used:

    def scrape_email(raw):
            
        if raw.is_multipart():
            return Mail.scrape_email(raw.get_payload(0))
        else:
            return raw.get_payload(None,True)