Search code examples
htmlscreen-scrapinghtml-parsing

What language/tool should I use for HTML parsing?


I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.).

Considering that I have no constraints regarding the technology, language or tool that I can use, what are your suggestions to easily parse and extract data from HTML pages? I have tried HTML Agility Pack, BeautifulSoup, and even these tools aren't perfect (HTML Agility Pack is buggy, and BeautifulSoup parsing engine doesn't work with the pages I am passing to it).


Solution

  • You can use pretty much any language you like just don't try and parse HTML with regular expressions.

    So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.

    If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.