ruby parsing web-crawler nokogiri open-uri

How to open and parse a large number of web pages fast and effectively with OpenURI and Nokogiri?

I'm writing an application in Ruby that can search and fetch data from a site that has more than 10000 pages. I use OpenURI and Nokogiri to open and parse web pages to get data from it and save them to a local data file::

#An example
page = Nokogiri::HTML(open("http://example.com/books/title001.html"))    
#Get title, author, synopsys, etc from that page

For me, who has an ADSL connection, it takes an average of 1 second to open a page. Because that site has about 10000 pages, it will take more than 3 hours to open all pages and fetch data of all of the books, an unacceptable time for this application because my users won't want to wait that much time.

How do I open and parse a large number of web pages fast and effectively with OpenURI and Nokogiri?

If I can't do that with them what should I do? And how can some applications that do the same work (list books, get all data from pages and save to a file) such as some manga downloaders just take 5-10 minutes to do that with large manga sites (about 10000 titles)?

Solution

Don't use OpenURI first; There is a much better way if you use Hydra and Typhoeus.

Like a modern code version of the mythical beast with 100 serpent heads, Typhoeus runs HTTP requests in parallel while cleanly encapsulating handling logic.

...

Parallel requests:
hydra = Typhoeus::Hydra.new
10.times.map{ hydra.queue(Typhoeus::Request.new("www.example.com", followlocation: true)) }
hydra.run

Farther down in the documentation...

How to get an array of responses back after executing a queue:

hydra = Typhoeus::Hydra.new
requests = 10.times.map { 
  request = Typhoeus::Request.new("www.example.com", followlocation: true)
  hydra.queue(request) 
  request
}
hydra.run

responses = request.map { |request|
  request.response.response_body
}

request.response.response_body is the line you want to wrap with Nokogiri's parser:

Nokogiri::HTML(request.response.response_body)

At that point you'll have an array of DOMs to walk through and process.

But wait! There's more!

Because you want to shave some processing time, you'll want to set up a Thread and Queue, push the parsed DOMs (or just the unparsed HTML response_body), then have the thread process and write the files.

It's not hard, but starts to put the question out of scope for Stack Overflow as it becomes a small book. Read the Thread and Queue documentation, especially the section about producers and consumers, and you should be able to piece it together. This is from the ri Queue docs:

= Queue < Object

(from ruby core)
------------------------------------------------------------------------------
This class provides a way to synchronize communication between threads.

Example:

  require 'thread'
  queue = Queue.new

  producer = Thread.new do
    5.times do |i|
       sleep rand(i) # simulate expense
       queue << i
       puts "#{i} produced"
    end
  end

  consumer = Thread.new do
    5.times do |i|
       value = queue.pop
       sleep rand(i/2) # simulate expense
       puts "consumed #{value}"
    end
  end
------------------------------------------------------------------------------
= Class methods:

  new

= Instance methods:

  <<, clear, deq, empty?, enq, length, num_waiting, pop, push, shift, size

I've used it to process large numbers of URLs in parallel and it was easy to set up and use. It's possible to do this using Threads for everything, and not use Typhoeus, but I think it's wiser to piggyback on the existing, well-written, tool than to try to roll your own.

... how can some applications that do the same work (list books, get all data from pages and save to a file) such as some manga downloaders just take 5-10 minutes to do that with large manga sites (about 10000 titles)?

They have:

fast connections to the internet.
the CPU power to process multiple connections.
the RAM to run multiple threads and hold a large number of pages waiting for processing.

It's not hard to process that many pages, you just have to be realistic about your resources and use what's available to use wisely.

What's my advice?

Don't try to open 100 pages at once; Your connection and CPU will be clogged and you'll reduce your throughput plus you could starve your app of RAM.
Run tests to determine where your point of diminishing returns lies, and don't allow more requests at once than that amount.
The consuming thread will easily stay ahead of the producing threads so you only need one consumer.