Search code examples
pythonhtml-parsingbeautifulsouplxmlpyquery

What’s the most forgiving HTML parser in Python?


I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but the results are almost same.

I can recall several HTML parser options available in Python from the top of my head:

  • BeautifulSoup
  • lxml
  • pyquery

I intend to test all of these, but I wanted to know which one in your tests come as most forgiving and can even try to parse bad HTML.


Solution

  • I ended up using BeautifulSoup 4.0 with html5lib for parsing and is much more forgiving, with some modifications to my code it's now working considerabily well, thanks all for suggestions.