Search code examples
pythonbeautifulsoup

BeautifulSoup extract base text


I have a div that looks somewhat like

<div>
    " Base Text "
    <span> 
        " Inner Text "
    </span>
    " Outer Base Text "
</div>

And I want to extract only the text not in the div's children (the immediate text), in this example, the immediate text is " Base Text " and " Outer Base Text ".

Is there any direct way (like a beautifulsoup function) to get the outer text in the div only, and ignore its inner contents?


Solution

  • Correction - there is a direct way - see comment from Barry above. Indirectly, you can do is get the whole tag, then list comprehension to keep only the main/parent tag/node:

    html_content = '''
    <div>
        Base Text
        <span> 
            Inner Text
        </span>
        Outer Base Text
    </div>
    '''
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    div = soup.find('div')
    
    # Extract the text directly within the div, excluding children
    text = ''.join([str(text) for text in div.strings if text.parent == div])