Search code examples
pythoncssbeautifulsoupinline

Remove all inline styles using BeautifulSoup


I'm doing some HTML cleaning with BeautifulSoup. Noob to both Python & BeautifulSoup. I've got tags being removed correctly as follows, based on an answer I found elsewhere on Stackoverflow:

[s.extract() for s in soup('script')]

But how to remove inline styles? For instance the following:

<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">

Should become:

<p>Text</p>
<img href="somewhere.com">

How to delete the inline class, id, name & style attributes of all elements?

Answers to other similar questions I could find all mentioned using a CSS parser to handle this, rather than BeautifulSoup, but as the task is simply to remove rather than manipulate the attributes, and is a blanket rule for all tags, I was hoping to find a way to do it all within BeautifulSoup.


Solution

  • You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:

    for tag in soup():
        for attribute in ["class", "id", "name", "style"]:
            del tag[attribute]
    

    Also, if you just want to delete entire tags (and their contents), you don't need extract(), which returns the tag. You just need decompose():

    [tag.decompose() for tag in soup("script")]
    

    Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.