Search code examples
pythonregexbeautifulsouphtml-content-extraction

Using BeautifulSoup to find a HTML tag that contains certain text


I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

<h2> this is cool #12345678901 </h2>

So, the previous would match by using:

soup('h2',text=re.compile(r' #\S{11}'))

And the results would be something like:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

I'm able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document tree. In this case, I'd want all the h2 elements to return, not the text matches.

Ideas?


Solution

  • from BeautifulSoup import BeautifulSoup
    import re
    
    html_text = """
    <h2>this is cool #12345678901</h2>
    <h2>this is nothing</h2>
    <h1>foo #126666678901</h1>
    <h2>this is interesting #126666678901</h2>
    <h2>this is blah #124445678901</h2>
    """
    
    soup = BeautifulSoup(html_text)
    
    
    for elem in soup(text=re.compile(r' #\S{11}')):
        print elem.parent
    

    Prints:

    <h2>this is cool #12345678901</h2>
    <h2>this is interesting #126666678901</h2>
    <h2>this is blah #124445678901</h2>