I'm writing an application in Ruby that can search and fetch data from a site that has more than 10000 pages. I use OpenURI and Nokogiri to open and parse web pages to get data from it and save them to a local data file::
#An example
page = Nokogiri::HTML(open("http://example.com/books/title001.html"))
#Get title, author, synopsys, etc from that page
For me, who has an ADSL connection, it takes an average of 1 second to open a page. Because that site has about 10000 pages, it will take more than 3 hours to open all pages and fetch data of all of the books, an unacceptable time for this application because my users won't want to wait that much time.
How do I open and parse a large number of web pages fast and effectively with OpenURI and Nokogiri?
If I can't do that with them what should I do? And how can some applications that do the same work (list books, get all data from pages and save to a file) such as some manga downloaders just take 5-10 minutes to do that with large manga sites (about 10000 titles)?
Don't use OpenURI first; There is a much better way if you use Hydra and Typhoeus.
Like a modern code version of the mythical beast with 100 serpent heads, Typhoeus runs HTTP requests in parallel while cleanly encapsulating handling logic.
...
Parallel requests:
hydra = Typhoeus::Hydra.new 10.times.map{ hydra.queue(Typhoeus::Request.new("www.example.com", followlocation: true)) } hydra.run
Farther down in the documentation...
How to get an array of responses back after executing a queue:
hydra = Typhoeus::Hydra.new requests = 10.times.map { request = Typhoeus::Request.new("www.example.com", followlocation: true) hydra.queue(request) request } hydra.run
responses = request.map { |request|
request.response.response_body
}
request.response.response_body
is the line you want to wrap with Nokogiri's parser:
Nokogiri::HTML(request.response.response_body)
At that point you'll have an array of DOMs to walk through and process.
But wait! There's more!
Because you want to shave some processing time, you'll want to set up a Thread and Queue, push the parsed DOMs (or just the unparsed HTML response_body
), then have the thread process and write the files.
It's not hard, but starts to put the question out of scope for Stack Overflow as it becomes a small book. Read the Thread and Queue documentation, especially the section about producers and consumers, and you should be able to piece it together. This is from the ri Queue
docs:
= Queue < Object
(from ruby core)
------------------------------------------------------------------------------
This class provides a way to synchronize communication between threads.
Example:
require 'thread'
queue = Queue.new
producer = Thread.new do
5.times do |i|
sleep rand(i) # simulate expense
queue << i
puts "#{i} produced"
end
end
consumer = Thread.new do
5.times do |i|
value = queue.pop
sleep rand(i/2) # simulate expense
puts "consumed #{value}"
end
end
------------------------------------------------------------------------------
= Class methods:
new
= Instance methods:
<<, clear, deq, empty?, enq, length, num_waiting, pop, push, shift, size
I've used it to process large numbers of URLs in parallel and it was easy to set up and use. It's possible to do this using Threads for everything, and not use Typhoeus, but I think it's wiser to piggyback on the existing, well-written, tool than to try to roll your own.
... how can some applications that do the same work (list books, get all data from pages and save to a file) such as some manga downloaders just take 5-10 minutes to do that with large manga sites (about 10000 titles)?
They have:
It's not hard to process that many pages, you just have to be realistic about your resources and use what's available to use wisely.
What's my advice?