Search code examples
rubymechanizemechanize-ruby

Skip large pages in Mechanize (Ruby)


I'm trying to skip processing a few large pages (some over 10MB) scattered in a result set, as Mechanize (version 2.7.3) crawls an array of links.

Unfortunately I can't find a 'content-length' property or a similar indicator. The Mechanize::FileResponse class has a content_length method but Mechanize::Page does not.

Current approach

At the moment I'm calling content.length on the page. This is very slow when one of the large pages is crawled:

detail_links.each do |detail_link|

  detail_page = detail_link.click

  # skip long pages
  break if detail_page.content.length > 100_000

  # rest of the processing
end

Content_length during response_read:

In the Mechanize source code I found a reference to content_length when the response is read. Is querying the response properties a possible solution?

# agent.rb extract from the Mechanize project
def response_read response, request, uri
    content_length = response.content_length

    if use_tempfile? content_length then
      body_io = make_tempfile 'mechanize-raw'
    else
      body_io = StringIO.new.set_encoding(Encoding::BINARY)
    end

Solution

  • Mechanize will normally "get" the entire page. Instead you should use a head request first to get the page size, then conditionally get the page. See "How can I perform a Head request using mechanize in Ruby" for an example.

    The thing to be careful of is that a dynamically generated resource might not have a known size when you do the head request, so you could get a response without the size entry. Notice that in the selected answer for the question linked above, that Google didn't return the content-length header because it's a dynamically generated page. Static pages and resources should have the header... unless the server doesn't return them for some reason.

    The Mechanize documentation mentions this:

    Problems with content-length

    Some sites return an incorrect content-length value. Unlike a browser, mechanize raises an error when the content-length header does not match the response length since it does not know if there was a connection problem or if the mismatch is a server bug.

    The error raised, Mechanize::ResponseReadError, can be converted to a parsed Page, File, etc. depending upon the content-type:

    agent = Mechanize.new
    uri = URI 'http://example/invalid_content_length'
    
    begin
      page = agent.get uri
    rescue Mechanize::ResponseReadError => e
      page = e.force_parse
    end
    

    In other words, while head can help, it's not necessarily going to give you enough information to allow you to skip huge pages. You have to investigate the site you're crawling and learn how their server responds.