Search code examples
pythonhtmlbeautifulsouphtml-parsing

How to dynamically find the nearest specific parent of a selected element?


I want to parse many html pages and remove a div that contains the text "Message", using beautifulsoup html.parser and python. The div has no name or id, so pointing to it is not possible. I am able to do this for 1 html page. In the code below, you will see 6 .parent . This is because there are 5 tags (p,i,b,span,a) between div tag and the text "Message", and 6th tag is div, in this html page. The code below works fine for 1 html page.

soup = BeautifulSoup(html_page,"html.parser")
scores = soup.find_all(text=re.compile('Message'))
divs = [score.parent.parent.parent.parent.parent.parent for score in scores]
divs.decompose()

The problem is - The number of tags between div and "Message" is not always 6. In some html page its 3, and in some 7.

So, is there a way to find the number of tags (n) between the text "Message" and nearest div to the left dynamically, and add n+1 number of .parent to score (in the code above) using python, beautifulsoup?


Solution

  • As described in your question, that there is no other <div> between, you could use .find_parent():

    soup.find(text=re.compile('Message')).find_parent('div').decompose()
    

    Be aware, that if you use find_all() you have to iterate your ResultSet while unsing .find_parent():

    for r in soup.find_all(text=re.compile('Message')):
        r.find_parent('div').decompose()
    

    As in your example divs.decompose() - You also should iterate the list.

    Example

    from bs4 import BeautifulSoup
    import re
    html='''
    <div>
        <span>
            <i>
                <x>Message</x>
            </i>
        </span>
    </div>
    '''
    soup = BeautifulSoup(html)
    
    soup.find(text=re.compile('Message')).find_parent('div')