Search code examples
pythonscreen-scrapingenumerate

re.compile only takes two arguments, is there a way to make it take more? Or another way around that?


I am able to access an email in txt file form on my computer, and now my goal is to scrape specific data out of it. I have utilized re.compile and enumerate to parse through the email looking for matching words (in my case, fish species such as GOM Cod), and then printing them. But there are 100's more emails I will need to parse thru, each with several different fish species listed in them....so my question is: what is the best way to go about this? I can't put all 17 different possible fish species into one re.compile function so should I just have 17 different blocks of the same code with just the fish species changed in each? Is that the most efficient way? My code is below.

import os
import email
import re

path = 'Z:\\folderwithemail'

for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
    with open(file_path, 'r') as f:
        sector_result = []
        pattern = re.compile("GOM Cod", re.IGNORECASE)
        for linenum, line in enumerate(f):
            if pattern.search(line) != None:
                sector_result.append((linenum, line.rstrip('\n')))
                for linenum, line in sector_result:
                    print ("Fish Species:", line)

Solution

  • You can alternate between fish species using the vertical bar |:

    A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B

    pattern = re.compile(r"GOM Cod|Salmon|Tuna", re.IGNORECASE)