python html python-3.x html-parsing lxml

Parsing HTML with Python with no regard for correct tag hierarchy

I would like to parse a document that is syntactically a html document (using tags with attributes etc), but structurally doesn't follow the rules (e.g. there could be a <html> tag inside a <div> tag inside a <body> tag). I also do not want the additional strictness of XML. Unfortunately, lxml only offers document_fromstring(), which requires a html root element, as well as fragment_fromstring(), which in turn does not allow there to be any html or body tags in unusual places.

How do I parse a document with no "fixing" of incorrect structure?

Solution

BeautifulSoup should do this fine.

it would be a case of:

from bs4 import BeautifulSoup
import requests

r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')

then you'd search "soup" for whatever you're looking for.