Search code examples
pythonbeautifulsoup

HTML unescape does not work with BeautifulSoup replace_with


I am trying to edit the inner HTML of some elements in Python using BeautifulSoup. Here is a simple example:

from bs4 import BeautifulSoup
import html

html_str = '<div><span><strong>Hello world</strong></span></div>'
soup = BeautifulSoup(html_str, 'html.parser')
span = soup.select_one('span')
span.replace_with('message: ' + html.unescape(span.decode_contents()) + ', end of message')

print(soup)

I was expecting to get a decoded string, like: <div>message: <strong>Hello world</strong>, end of message</div>

But instead I got: <div>message: &lt;strong&gt;Hello world&lt;/strong&gt;, end of message</div>

Notice that this behaviour only happens when the target element contains a child, e.g. if you try to execute this code on the strong element (with soup.select_one('strong')), it works as expected.


Solution

  • The easiest way is to use .replace_with with new BeautifulSoup object, e.g.:

    from bs4 import BeautifulSoup
    
    html_str = "<div><span><strong>Hello world</strong></span></div>"
    soup = BeautifulSoup(html_str, "html.parser")
    
    span = soup.select_one("span")
    span.replace_with(BeautifulSoup(f"message: {str(span)}, end of message", "html.parser"))
    
    print(soup)
    

    Prints:

    <div>message: <span><strong>Hello world</strong></span>, end of message</div>