Search code examples
pythonbeautifulsouplxmlhtml-parser

python beautifulsoup : lxml html.parser


I must use beautifulsoup, but i don't know which parser I have to take. I hesitate between lxml and html.parser, or why not both. How to know if a web page is lxml compliant ? How to know if a web page is html parser compliant ? Many thanks


Solution

  • There is no silver bullet. Different HTML parsers behave differently and you should pick the one that works for your particular page. Works in this case basically means, that you can get to your desired data.

    lxml parser is generally faster, html5lib is the most lenient one - this kind of difference would be relevant if you have a broken or non-well-formed HTML to parse. html.parser is built-in and can help to avoid extra dependencies, if this is a problem. Here is a related table that highlights the differences.