Search code examples
common-lispweb-scrapingquicklisp

Common Lisp package for parsing invalid HTML?


As a learning exercise, I'm writing a web scraper in Common Lisp. The (rough) plan is:

  1. Use Quicklisp to manage dependencies
  2. Use Drakma to load the pages
  3. Parse the pages with xmls

I've just run into a sticking point: the website I'm scraping doesn't always produce valid XHTML. This means that step 3 (parse the pages with xmls) doesn't work. And I'm as loath to use regular expressions as this guy :-)

So, can anyone recommend a Common Lisp package for parsing invalid XHTML? I'm imagining something similar to the HTML Agility Pack for .NET ...


Solution

  • The "closure-html" project (available in Quicklisp) will recover from bogus HTML and produce something with which you can work. I use closure-html together with CXML to process arbitrary web pages, and it works nicely. http://common-lisp.net/project/closure/closure-html/