Search code examples
pythonhtmlweb-scrapinghtml-parsingbeautifulsoup

Beautiful Soup 4: Remove comment tag and its content


The page that I'm scraping contains these HTML codes. How do I remove the comment tag <!-- --> along with its content with bs4?

<div class="foo">
cat dog sheep goat
<!-- 
<p>NewPP limit report
Preprocessor node count: 478/300000
Post‐expand include size: 4852/2097152 bytes
Template argument size: 870/2097152 bytes
Expensive parser function count: 2/100
ExtLoops count: 6/100
</p>
-->
</div>

Solution

  • You can use extract() (solution is based on this answer):

    PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.

    from bs4 import BeautifulSoup, Comment
    
    data = """<div class="foo">
    cat dog sheep goat
    <!--
    <p>test</p>
    -->
    </div>"""
    
    soup = BeautifulSoup(data)
    
    div = soup.find('div', class_='foo')
    for element in div(text=lambda text: isinstance(text, Comment)):
        element.extract()
    
    print soup.prettify()
    

    As a result you get your div without comments:

    <div class="foo">
        cat dog sheep goat
    </div>