Search code examples
pythonhtmlweb-scrapingbeautifulsouphtml-parsing

Get text from specific blocks excluding some nested tags


I have been trying to make a Python script which actually extracts text from a specific block of element but has to exclude some text within nested siblings.

This is my HTML part I'm trying to scrape:

<div class="article_body">
    <div id="articleBodyContents">
        Stack Overflow
        <br/>
        Is Love
        <br/>
        <a href="https://example_site1.com" target="_blank">Ad</a>
        <br/>
        <a href="https://example_site2.com" target="_blank">Ad2</a>
    </div>
</div>

Here is so far I've progressed:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
divs = soup.findAll('div', {'id':'articleBodyContents'})
for ops in divs:
    print(ops.text.replace('\n', '').strip())

However this prints out:

Stack Overflow
Is love
Ad
Ad2

What I want is only:

Stack Overflow
Is love

Solution

  • You are nearly there. You'd need help of NavigableString to achieve this. Just catch the previous parent, and iterate over it checking if the strings are an instance of NavigableString. Here is your code:

    from bs4 import BeautifulSoup, NavigableString
    
    html = """
    <div class="article_body">
        <div id="articleBodyContents">
            Stack Overflow
            <br/>
            Is love
            <br/>
            <a href="https://example_site1.com" target="_blank">Ad</a>
            <br/>
            <a href="https://example_site2.com" target="_blank">Ad2</a>
        </div>
    </div>
    """
    
    soup = BeautifulSoup(html, "html.parser")
    divs = soup.find('div', {'class':'article_body'})
    ops = [element for element in divs.div if isinstance(element, NavigableString)]
    for op in ops:
        print(op.strip().replace('\n', ''))
    

    Output:

    Stack Overflow
    Is love