We're using Beautiful Soup to parse many websites successfully, but a few are given us problems. An example is this page:
We're feeding the exact source to beautiful soup, but it returns a stunted HTML string, though no errors...
Code:
soup = BeautifulSoup(site_html)
print str(soup.html)
Result:
<html class="no-js" lang="en"> <!--<![endif]--> </html>
I'm trying to determine what's tripping it up, but nothing jumps out at me looking at the html source. Does anyone have some insight?
Try different parsers, the page parses fine with the html5lib
parser:
>>> soup = BeautifulSoup(r.content, 'html5')
>>> len(soup.find_all('li'))
97
Not all parsers can treat broken HTML the same.