Search code examples
pythonbeautifulsouptags

How can I remove all the tags in a beautiful object without remove specific tags like <strong> or <em>?


Given the following html, how can I remove all the tags, except stylistic tags, like <strong> or <em>, in BeautifulSoup?

    <ol class="journal">
    <li>A. Gilad Kusne, Heshan Yu, Changming Wu, Huairuo Zhang, Jason Hattrick-Simpers, Brian 
DeCost, Suchismita Sarker, Corey Oses, Cormac Toher, Stefano Curtarolo, Albert V. Davydov, 
Ritesh Agarwal, Leonid A. Bendersky, Mo Li, Apurva Mehta, Ichiro Takeuchi. <strong>On-the-fly 
closed-loop materials discovery via Bayesian active learning</strong>. <em>Nature Communications</em>, 2020; 11 (1) DOI: <a href="http://dx.doi.org/10.1038/s41467-020-19597-w" rel="nofollow" target="_blank">10.1038/s41467-020-19597-w</a>
    </li>
    </ol>

I know I could use regex to remove specific tags, but is there any elegant way to remove some tags while excluding others in BeautifulSoup?


Solution

  • Try this:

    import re
    from bs4 import BeautifulSoup as bs
    
    html = """<ol class="journal">
        <li>A. Gilad Kusne, Heshan Yu, Changming Wu, Huairuo Zhang, Jason 
    Hattrick-Simpers, Brian DeCost, Suchismita Sarker, Corey Oses, Cormac Toher, 
    Stefano Curtarolo, Albert V. Davydov, Ritesh Agarwal, Leonid A. Bendersky, 
    Mo Li, Apurva Mehta, Ichiro Takeuchi. <strong>On-the-fly closed-loop 
    materials discovery via Bayesian active learning</strong>. 
    <em>Nature Communications</em>, 2020; 11 (1) DOI: 
    <a href="http://dx.doi.org/10.1038/s41467-020-19597-w" rel="nofollow" 
    target="_blank">10.1038/s41467-020-19597-w</a>
        </li>
        </ol>"""
    soup = bs(html, features='xml')
    tags = [tag.name for tag in soup.find_all(True) if tag.name not in ['strong', 'em']]
    for tag in tags:
        html = re.sub(f'</?{tag}[^>]*>', '', html)
    print(html)
    

    The output:

    A. Gilad Kusne, Heshan Yu, Changming Wu, Huairuo Zhang, Jason Hattrick-Simpers, 
    Brian DeCost, Suchismita Sarker, Corey Oses, Cormac Toher, Stefano Curtarolo, 
    Albert V. Davydov, Ritesh Agarwal, Leonid A. Bendersky, Mo Li, Apurva Mehta, 
    Ichiro Takeuchi. <strong>On-the-fly closed-loop materials discovery 
    via Bayesian active learning</strong>. <em>Nature Communications</em>, 
    2020; 11 (1) DOI: 10.1038/s41467-020-19597-w