Search code examples
pythonhtmlpython-3.xbeautifulsouphtml-parsing

How to make BeautifulSoup "understand" the plus html entity


Let's say we have an html file like this:

test.html

<div>
<i>Some text here.</i>
Some text here also.<br>
2 &plus; 4 = 6<br>
2 &lt; 4 = True
</div>

If I will pass this html into BeautifulSoup it will escape the & sign near the plus entity and output html will be something like this:

<div>
<i>Some text here.</i>
Some text here also.<br>
2 &amp;plus 4 = 6<br>
2 &lt; 4 = True
</div>

Example python3 code:

from bs4 import BeautifulSoup

with open('test.html', 'rb') as file:
    soup = BeautifulSoup(file, 'html.parser')

print(soup)

How can I avoid this behavior?


Solution

  • Read the description of different parser libraries: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser:

    This could solve your problem:

    s = '''
    <div>
    <i>Some text here.</i>
    Some text here also.<br>
    2 &plus; 4 = 6<br>
    2 &lt; 4 = True
    </div>'''
    
    soup = BeautifulSoup(s, 'html5lib')
    

    And you get:

    >>> soup
    
    <html><head></head><body><div>
    <i>Some text here.</i>
    Some text here also.<br/>
    2 + 4 = 6<br/>
    2 &lt; 4 = True
    </div></body></html>