Search code examples
pythonhtmlpython-3.xweb-scrapingbeautifulsoup

How to get text in beautifulsoup as .innerText and not as .textContent in JS


I have an HTML file that contains text inside a p tag, something like this:

<body>
    <p>Lorem ipsum dolor sit amet, 
        consectetur adipiscing elit. 
        Maecenas sed mi lacus. 
        Vivamus luctus vehicula lacus, 
        ut malesuada justo posuere et. 
        Donec ut diam volutpat</p>
</body>

Using Python and BeautifulSoup I tried to get to the text in the p tag, like:

with open("foo.html", 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f.read(), 'lxml')
p = soup.p
print(p.text)

and the result: 'Lorem ipsum dolor sit amet, \n\t\tconsectetur adipiscing elit. \n\t\tMaecenas sed mi lacus. \n\t\tVivamus luctus vehicula lacus, \n\t\tut malesuada justo posuere et. \n\t\tDonec ut diam volutpat'

The problem is that I get the result together with the \n and \t that appear in the original file (like .textContent in JS). I need a solution that was similar to .innerText in JS that returns as the user sees in the browser.

I tried using p.text.replace("\n", " ").replace("\t", "") But for more complicated things, like a tag within a tag, it just doesn't work (like unnecessary spaces).

Does anyone have an idea how to do this? Thanks in advance!


Solution

  • If I understand you correctly, you can use regular expression to change the text. Consider this example:

    from bs4 import BeautifulSoup
    
    html_text = """\
    <body>
        <p>Lorem ipsum dolor sit amet,
            consectetur adipiscing elit.
            Maecenas sed mi lacus.
                <span>This is inner span.</span>
            Vivamus luctus vehicula lacus,
            ut malesuada justo posuere et.
            Donec ut diam volutpat</p>
    </body>"""
    
    soup = BeautifulSoup(html_text, "html.parser")
    print(soup.p.text)
    

    Prints:

    Lorem ipsum dolor sit amet,
            consectetur adipiscing elit.
            Maecenas sed mi lacus.
                This is inner span.
            Vivamus luctus vehicula lacus,
            ut malesuada justo posuere et.
            Donec ut diam volutpat
    

    You can do then:

    import re
    
    print(re.sub(r"\s{2,}", " ", soup.p.text))
    

    This prings:

    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas sed mi lacus. This is inner span. Vivamus luctus vehicula lacus, ut malesuada justo posuere et. Donec ut diam volutpat