Search code examples
pythonregexpython-3.xparsinghtml-parsing

Removing tags from HTML, except specific ones (but keep their contents)


I use this code to delete all tag elements in HTML. I need to keep <br> and <br/> . So I use this code:

import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'\1', MyString)
print(MyString)

The output is:

aaaRadio and<BR> television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb

The result is right, but now I want to keep <p> and </p> and <br> and <br/> .

How can I modify my code?


Solution

  • Using an HTML parser is much more robust than using regex. Regex should not be used to parse nested structures like HTML.

    Here's a working implementation which iterates over all HTML tags and for those who are not p or br, strips them of the tag:

    from bs4 import BeautifulSoup
    
    mystring = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
    
    soup = BeautifulSoup(mystring,'html.parser')
    for e in soup.find_all():
        if e.name not in ['p','br']:
            e.unwrap()
    print(soup)
    

    Output:

    aaa<p>Radio and<br/> television.<br/></p><p>very<br> popular in the world today.</br></p><p>Millions of people watch TV. </p><p>That’s because a radio is very small 98.2%</p><p>and it‘s easy to carry. haha100%</p>bb