python python-3.x parsing beautifulsoup html-parsing

Missing special characters and tags while parsing HTML using BeautifulSoup

I am trying to parse a HTML document using BeautifulSoup with Python.

But it stops parsing at special characters, like here:

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''
soup = BeautifulSoup(doc,  'html.parser')
print(soup)

This code should output the whole document. Instead, it prints only

<html>
<body>
<div>And I said «What the %</div></body></html>

The rest of the document is apparently lost. It was stopped by the combination '&#'.

The question is, how to either setup BS or preprocess the document, to avoid such problems but lose as little text (which may be informative) as possible?

I use bs4 of version 4.6.0 with Python 3.6.1 on Windows 10.

Update. The method soup.prettify() does not work, because the soup is already broken.

Solution

You need to use the "html5lib" as the parser instead of "html.parser" in your BeautifulSoup object. For example:

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''

soup = BeautifulSoup(doc,  'html5lib')
#          different parser  ^

Now, if you'll print soup it will display your desired string:

>>> print(soup)
<html><head></head><body>
        <div>And I said «What the %&amp;#@???»</div>
        <div>some other text</div>

</body></html>

From the Difference Between Parsers document:

Unlike html5lib, html.parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.