Search code examples
rubycss-selectorsnokogirimechanize

Mechanize search unable to find CSS selector (it's definitely present)


I have a long CSS selector that works perfectly fine when actually used in CSS, jQuery etc. But this very same selector will not work on a Mechanize::Page object - it simply returns an empty array.

The selector targets a paragraph and in my other case a header1. I also converted my page result to string with page.body, and that element is definitely there, but the search (or at) method will not return me anything.

What could be the cause of this?

My code looks like this:

agent = Mechanize.new
page  = agent.get 'http://example.com'

page.search(source.read_more_selector).each do |read_more|
  inner_page = agent.get(read_more['href'])
  # displaying inner_page.body gives me a few valid HTML pages, but...

  inner_page.search(source.inner_title_selector).each do |inner_content|
    # but here, there's nothing here, inner_content is nil even though the selector should get us something back definitely
  end
end

Normally working CSS selector (source.inner_content_selector)

div#main-container-body > div#body-container > table > tbody > tr > td > span#ajaxprochoice > table > tbody > tr > td > table > tbody > tr > td > table > tbody > tr > td > div > h1.h1productHead

Output of inner_page.body (one of the many loop results. Can't be added here due to too many characters):

http://pastebin.com/MtXDVADR

So the above selector is supposed to definitely match the paragraph inside that HTML code (of course, while it's a Mechanize::Page object, not a string) with inner_page.search, but it's not.

I went to the actual page online and opened up my console and ran this simple jQuery command to try that out:

$('div#main-container-body > div#body-container > table > tbody > tr > td > span#ajaxprochoice > table > tbody > tr > td > table > tbody > tr > td > table > tbody > tr > td > div > h1.h1productHead').hide();

And it worked! Which pretty much means the selector is valid here.

Edit

When I added this piece of code:

inner_page.at('.h1productHead').to_s

This returned me a result. But when I use the full selector, it doesn't return anything. Why is Mechanize being inflexible with selectors in this case?


Solution

  • The page you are searching doesn’t contain any tbody tags. When your browser parses the page it adds the missing tbody elements into the DOM that it creates. This means that when you examine the page through the browser’s inspector and console it acts like the tbody tags exist.

    Nokogiri doesn’t add this tag when parsing. When you use Nokogiri to search for your query (which contains tbody) it looks for an explicit tbody tag, and so returns no matches when it fails to find one.

    The simplest fix is to remove all the tbodys from your query (along with any extra >s).

    You could also look into Nokogumbo, which extends Nokogiri with Google’s Gumbo HTML5 parser, and which does add the tbody elements into the parsed document.