Search code examples
pythonbeautifulsoupscreen-scraping

Excluding unwanted results of findAll using BeautifulSoup


Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:

<p class="review_comment">

So, using the simple code as follows,

content = page.read()  
soup = BeautifulSoup(content)  
results = soup.find_all("p", "review_comment")

I am happily parsing the text that is living here:

<p class="review_comment">
    This place is terrible!</p>

The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:

<p class="review_comment">
    It's 1999, and I will always love this place…  
<a href="#" class="show-archived">Read more &raquo;</a></p>

In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.

  • I've been trying to alter the arguments in my soup.find_all() call to specifically exclude any text that comes before the <a href="#" class="show-archived">Read more &raquo;</a>
  • I've drowned in Regular Expressions-type matching limbo with no success.
  • I can't seem to take advantage of the class="show-archived" attribute.

Any ideas would be gratefully appreciated. Thanks in advance.


Solution

  • Is this what you are seeking?

    for p in soup.find_all("p", "review_comment"):
        if p.find(class_='show-archived'):
            continue
        # p is now a wanted p