Search code examples
pythonhtmlpython-3.xhtml-parsinglxml

Parsing HTML with Python with no regard for correct tag hierarchy


I would like to parse a document that is syntactically a html document (using tags with attributes etc), but structurally doesn't follow the rules (e.g. there could be a <html> tag inside a <div> tag inside a <body> tag). I also do not want the additional strictness of XML. Unfortunately, lxml only offers document_fromstring(), which requires a html root element, as well as fragment_fromstring(), which in turn does not allow there to be any html or body tags in unusual places.

How do I parse a document with no "fixing" of incorrect structure?


Solution

  • BeautifulSoup should do this fine.

    it would be a case of:

    from bs4 import BeautifulSoup
    import requests
    
    r = requests.get(url)
    soup = BeautifulSoup(r.text,'html.parser')
    

    then you'd search "soup" for whatever you're looking for.