I would like to parse a document that is syntactically a html document (using tags with attributes etc), but structurally doesn't follow the rules (e.g. there could be a <html>
tag inside a <div>
tag inside a <body>
tag). I also do not want the additional strictness of XML. Unfortunately, lxml only offers document_fromstring()
, which requires a html root element, as well as fragment_fromstring()
, which in turn does not allow there to be any html
or body
tags in unusual places.
How do I parse a document with no "fixing" of incorrect structure?
BeautifulSoup should do this fine.
it would be a case of:
from bs4 import BeautifulSoup
import requests
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
then you'd search "soup" for whatever you're looking for.