Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.

I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.

Ideas?

Solution

Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.

The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.

Paul Dix, the original author, talked about his design goals on his blog.

This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:

#!/usr/bin/env ruby

require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'

BASE_URL = ''

url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)

hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
  gzip_url = url.join(gzip)
  request = Typhoeus::Request.new(gzip_url.to_s)

  request.on_complete do |resp|
    gzip_filename = resp.request.url.split('/').last
    puts "writing #{gzip_filename}"
    File.open("gz/#{gzip_filename}", 'w') do |fo|
      fo.write resp.body
    end  
  end
  puts "queuing #{ gzip }"
  hydra.queue(request)
end

hydra.run

Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.

I give it a 8 out of 10; It's got a great beat and I can dance to it.

EDIT:

When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.