Search code examples
htmlrubyparsingrubygems

Parse HTML using ruby core libraries? (ie, no gems required)


Some friends and I have been working on a set of scripts that make it easier to do work on the machines at uni. One of these tools currently uses Nokogiri, but in order for these tools to run on all machines with as little setup as possible we've been trying to find a 'native' html parser, instead of requiring users to install RVM and custom gems (due to disk space limitations for most users).

Are we pretty much restricted to Nokogiri/Hpricot/? Should we look at just writing our own custom parser that fits our needs?

Cheers.

EDIT: If there's posts on here that I've missed in my searches, let me know! S.O. is sometimes just too large to find things effectively...


Solution

  • There is no html parser in ruby stdlib
    html parsers have to be more forgiving of bad markup than xml parsers

    You could run the html though tidy (http://tidy.sourceforge.net)
    to tidy up the html and produce valid markup
    This can now be read via rexml :-) which is in stdlib

    rexml is much slower than nokogiri, last checked in 2009
    Sam Ruby had been working on making rexml faster though

    A better way would be to have a better deployment
    Take a look at http://gembundler.com/bundle_package.html and using capistrano (or some such) to provision servers