Search code examples
ruby-on-railsrubynokogirimechanizemechanize-ruby

How do I convert a Nokogiri statement into Mechanize for screen scraping?


I'm trying to use Mechanize to scape some tags from a page. I've used Nokogiri successfully to scrape them before, but now I'm trying to combine them into a wider Mechanize class. Here is the Nokogiri statement:

page = Nokogiri::HTML(open(@model.url, "User-Agent" => request.env['HTTP_USER_AGENT']))
@model.icons = page.css("link[rel='apple-touch-icon']").to_s

And here is what I thought would be the Mechanize equivalent but it's not working:

agent = Mechanize.new
page = agent.get(@model.url, "User-Agent" => request.env['HTTP_USER_AGENT'])
@model.icons = page.search("link[rel='apple-touch-icon']").to_s

The first one returns a link tag as expected <link rel="apple-touch-icon" etc etc..></link>. The second statement returns a blank string. If I take the to_s off the end I get a super long output. I assume it's an error or the actual Mechanize object or something.

Link to long output when not converting to string: https://gist.github.com/eadam/5583541


Solution

  • Without sample HTML it's difficult to recreate the problem, so this is some general information that might help you.

    That "long output" is the inspect output of the Nokogiri::NodeSet you got when you used the search method. If search returns multiple nodes, or the nodes have lots of children, the inspect output can go on for a ways, but, that's what it should do.

    css and search are very similar, in that they return a NodeSet. css assumes that the string passed in is a CSS accessor, while search is more generic, and attempts to figure out whether what was passed in was a CSS or XPath expression. If it figures wrong the odds are bad for the pattern to find a match. You can use at or search to be generic and let Nokogiri figure it out, or at_css, at_xpath or css and xpath to respectively replace them. The at derivations all return the first matching Node, similar to using search('some_path').first.

    to_s turns the NodeSet back into a representation of the source that was passed in. I prefer to be more explicit, using either to_xml, to_xhtml or to_html.

    Why don't you get output for search like you do for css? I don't know because I can't test against the HTML you're parsing. Answering questions, like data-processing, is a GIGO situation.