I am trying to parse a HTML document using BeautifulSoup
with Python.
But it stops parsing at special characters, like here:
from bs4 import BeautifulSoup
doc = '''
<html>
<body>
<div>And I said «What the %&#@???»</div>
<div>some other text</div>
</body>
</html>'''
soup = BeautifulSoup(doc, 'html.parser')
print(soup)
This code should output the whole document. Instead, it prints only
<html>
<body>
<div>And I said «What the %</div></body></html>
The rest of the document is apparently lost. It was stopped by the combination '&#'
.
The question is, how to either setup BS or preprocess the document, to avoid such problems but lose as little text (which may be informative) as possible?
I use bs4 of version 4.6.0 with Python 3.6.1 on Windows 10.
Update. The method soup.prettify()
does not work, because the soup
is already broken.
You need to use the "html5lib" as the parser instead of "html.parser" in your BeautifulSoup
object. For example:
from bs4 import BeautifulSoup
doc = '''
<html>
<body>
<div>And I said «What the %&#@???»</div>
<div>some other text</div>
</body>
</html>'''
soup = BeautifulSoup(doc, 'html5lib')
# different parser ^
Now, if you'll print soup
it will display your desired string:
>>> print(soup)
<html><head></head><body>
<div>And I said «What the %&#@???»</div>
<div>some other text</div>
</body></html>
From the Difference Between Parsers document:
Unlike
html5lib
,html.parser
makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.