Search code examples
pythonpython-3.xparsingbeautifulsouphtml-parsing

Missing special characters and tags while parsing HTML using BeautifulSoup


I am trying to parse a HTML document using BeautifulSoup with Python.

But it stops parsing at special characters, like here:

from bs4 import BeautifulSoup
doc = '''
<html>
    <body>
        <div>And I said «What the %&#@???»</div>
        <div>some other text</div>
    </body>
</html>'''
soup = BeautifulSoup(doc,  'html.parser')
print(soup)

This code should output the whole document. Instead, it prints only

<html>
<body>
<div>And I said «What the %</div></body></html>

The rest of the document is apparently lost. It was stopped by the combination '&#'.

The question is, how to either setup BS or preprocess the document, to avoid such problems but lose as little text (which may be informative) as possible?

I use bs4 of version 4.6.0 with Python 3.6.1 on Windows 10.

Update. The method soup.prettify() does not work, because the soup is already broken.


Solution

  • You need to use the "html5lib" as the parser instead of "html.parser" in your BeautifulSoup object. For example:

    from bs4 import BeautifulSoup
    doc = '''
    <html>
        <body>
            <div>And I said «What the %&#@???»</div>
            <div>some other text</div>
        </body>
    </html>'''
    
    soup = BeautifulSoup(doc,  'html5lib')
    #          different parser  ^
    

    Now, if you'll print soup it will display your desired string:

    >>> print(soup)
    <html><head></head><body>
            <div>And I said «What the %&amp;#@???»</div>
            <div>some other text</div>
    
    </body></html>
    

    From the Difference Between Parsers document:

    Unlike html5lib, html.parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesn’t even bother to add an tag.