Search code examples
pythonparsingtextbeautifulsoup

Only extracting text from this element, not its children


I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:

I have

import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text

The output to this is yesno. I want simply 'yes'.

What's the best way of achieving this?

Edit: I also want yes to be output when parsing '<html><b>no</b>yes</html>'.


Solution

  • In modern (as of 2023-06-17) BeautifulSoup4, given:

    from bs4 import BeautifulSoup
    node = BeautifulSoup("""
    <html>
        <div>
            <span>A</span>
            B
            <span>C</span>
            D
        </div>
    </html>""").find('div')
    

    Use the following to get direct children text elements (BD):

    s = "".join(node.find_all(string=True, recursive=False))
    

    And the following to get all descendants text elements (ABCD):

    s = "".join(node.find_all(string=True, recursive=True))