Search code examples
pythonweb-scrapingbeautifulsoupscraperfrontpage

BeautifulSoup: Strip specified attributes, but preserve the tag and its contents


I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it.

However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them. The code snippet:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

It runs without error, but doesn't actually strip any of the attributes. When I run it without the outer loop, just hard coding a single attribute (soup.findAll('style'=True), it works.

Anyone see know the problem here?

PS - I don't much like the nested loops either. If anyone knows a more functional, map/filter-ish style, I'd love to see it.


Solution

  • The line

    for tag in soup.findAll(attribute=True):
    

    does not find any tags. There might be a way to use findAll, I'm not sure.

    However, this works (as of beautifulsoup 4.8.1):

    import bs4
    REMOVE_ATTRIBUTES = [
        'lang','language','onmouseover','onmouseout','script','style','font',
        'dir','face','size','color','style','class','width','height','hspace',
        'border','valign','align','background','bgcolor','text','link','vlink',
        'alink','cellpadding','cellspacing']
    
    doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
    soup = bs4.BeautifulSoup(doc)
    for tag in soup.descendants:
        if isinstance(tag, bs4.element.Tag):
            tag.attrs = {key: value for key, value in tag.attrs.items()
                         if key not in REMOVE_ATTRIBUTES}
    print(soup.prettify())
    

    This is previous code that may have worked with an older version of beautifulsoup:

    import BeautifulSoup
    REMOVE_ATTRIBUTES = [
        'lang','language','onmouseover','onmouseout','script','style','font',
        'dir','face','size','color','style','class','width','height','hspace',
        'border','valign','align','background','bgcolor','text','link','vlink',
        'alink','cellpadding','cellspacing']
    
    doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
    soup = BeautifulSoup.BeautifulSoup(doc)
    for tag in soup.recursiveChildGenerator():
        try:
            tag.attrs = [(key,value) for key,value in tag.attrs
                         if key not in REMOVE_ATTRIBUTES]
        except AttributeError: 
            # 'NavigableString' object has no attribute 'attrs'
            pass
    print(soup.prettify())
    

    Note this this code will only work in Python 3. If you need it to work in Python 2, see Nóra's answer below.