Search code examples
pythonregexweb-scraping

How do I exclude a string from re.findall?


This might be a silly question, but I'm just trying to learn!

I'm trying to build a simple email search tool to learn more about python. I'm modifying some open source code to parse the email address:

emails = re.findall(r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)

Then I'm writing the results into a spreadsheet using the CSV module.

Since I'd like to keep the domain extension open to almost any, my results are outputting image files with an email type format:

example: [email protected]

How can I add to exclude "png" string from re.findall

Code:

  def scrape(self, page):
    try:
        request = urllib2.Request(page.url.encode("utf8"))
        html    = urllib2.urlopen(request).read()
    except Exception, e:
        return
       emails = re.findall(r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
       for email in emails:
        if email not in self.emails:  # if not a duplicate
            self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
            self.emails.append(email)

Solution

  • you already are only acting on an if ... just make part of the if check ... ...that will be much much much easier than trying to exclude it from the regex

    if email not in self.emails and not email.endswith("png"):  # if not a duplicate
            self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
            self.emails.append(email)