ruby-on-rails ruby nokogiri mechanize mechanize-ruby

How do I convert a Nokogiri statement into Mechanize for screen scraping?

I'm trying to use Mechanize to scape some tags from a page. I've used Nokogiri successfully to scrape them before, but now I'm trying to combine them into a wider Mechanize class. Here is the Nokogiri statement:

page = Nokogiri::HTML(open(@model.url, "User-Agent" => request.env['HTTP_USER_AGENT']))
@model.icons = page.css("link[rel='apple-touch-icon']").to_s

And here is what I thought would be the Mechanize equivalent but it's not working:

agent = Mechanize.new
page = agent.get(@model.url, "User-Agent" => request.env['HTTP_USER_AGENT'])
@model.icons = page.search("link[rel='apple-touch-icon']").to_s

The first one returns a link tag as expected <link rel="apple-touch-icon" etc etc..></link>. The second statement returns a blank string. If I take the to_s off the end I get a super long output. I assume it's an error or the actual Mechanize object or something.

Link to long output when not converting to string: https://gist.github.com/eadam/5583541

Solution

Without sample HTML it's difficult to recreate the problem, so this is some general information that might help you.

That "long output" is the inspect output of the Nokogiri::NodeSet you got when you used the search method. If search returns multiple nodes, or the nodes have lots of children, the inspect output can go on for a ways, but, that's what it should do.

css and search are very similar, in that they return a NodeSet. css assumes that the string passed in is a CSS accessor, while search is more generic, and attempts to figure out whether what was passed in was a CSS or XPath expression. If it figures wrong the odds are bad for the pattern to find a match. You can use at or search to be generic and let Nokogiri figure it out, or at_css, at_xpath or css and xpath to respectively replace them. The at derivations all return the first matching Node, similar to using search('some_path').first.

to_s turns the NodeSet back into a representation of the source that was passed in. I prefer to be more explicit, using either to_xml, to_xhtml or to_html.

Why don't you get output for search like you do for css? I don't know because I can't test against the HTML you're parsing. Answering questions, like data-processing, is a GIGO situation.