Search code examples
pythoneclipseweb-scrapingbeautifulsoup

BeautifulSoup webscrape, isolate specific tag with random html class


new to web scraping here. I've managed to successfully scrape a website, however i've encountered one problem. Within the article class there is usually only one 'p' tag, however sometimes randomly in an article class there will be two or three 'p' tags with some irrelevant text. The tag I want always appears like this:

<p onclick="window.location.href = 'https://www.blahblah.com/somenumbers'">
some blah blah text
</p>

whereas the other randomly appearing 'p' tags only appear as

<p> irrelevant text </p>

The problem is I don't know how to grab only the 'p onclick' tag because while the website is always the same, the 'some numbers' bit always changes. I only need the blah blah text within the 'p onclick' tag. At the moment I'm scraping all the text from the p tags, so for most of the information i get the required text, but then when the random p tags appear i also scrape the irrelevant text. They also appear in random order, so using 'content' doesn't work.

I've tried various combinations of soup.findAll but the thing that stumps me are those changing website numbers. Can anyone please offer a solution?

Thanks in advance.

Vic


Solution

  • You could specify to find_all that the tag must have a non-empty onclick attribute with regular expressions, examples are given in the docs.

    For your case:

    >>> from bs4 import BeautifulSoup
    >>> import re
    >>> 
    >>> soup = BeautifulSoup('<p> blabla</p> and <p onclick="js action">blabla</p>')
    >>> soup.find_all('p', onclick=re.compile('.'))
    [<p onclick="js action">blabla</p>]