Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:
<p class="review_comment">
So, using the simple code as follows,
content = page.read()
soup = BeautifulSoup(content)
results = soup.find_all("p", "review_comment")
I am happily parsing the text that is living here:
<p class="review_comment">
This place is terrible!</p>
The bad news is that every 30 or so times the soup.find_all
gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:
<p class="review_comment">
It's 1999, and I will always love this place…
<a href="#" class="show-archived">Read more »</a></p>
In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.
soup.find_all()
call
to specifically exclude any text that comes before the <a href="#"
class="show-archived">Read more »</a>
class="show-archived"
attribute.Any ideas would be gratefully appreciated. Thanks in advance.
Is this what you are seeking?
for p in soup.find_all("p", "review_comment"):
if p.find(class_='show-archived'):
continue
# p is now a wanted p