Search code examples
rubynet-httppipelining

Check headers before downloading with Net::HTTP::Pipeline


I am trying to parse a list of image URL's and get some basic information before I actually commit to download.

  1. Is the image there (solved with response.code?)
  2. Do I have the image already (want to look at type and size?)

My script will check a large list every day (about 1300 rows) and each row has 30-40 image URLs. My @photo_urls variable allows me to keep track of what I have downloaded already. I would really like to be able to use that later as a hash (instead of an array in my example code) to interate through later and do the actual downloading.

Right now my problem (besides being a Ruby newbie) is that Net::HTTP::Pipeline only accepts an array of Net::HTTPRequest objects. The documentation for net-http-pipeline indicates that response objects will come back in the same order as the corresponding request objects that went in. The problem is that I have no way to correlate the request to the response other than that order. However, I don't know how to get relative ordinal position inside a block. I assume I could just have a counter variable but how would I access a hash by ordinal position?

          Net::HTTP.start uri.host do |http|
            # Init HTTP requests hash
            requests = {}
            photo_urls.each do |photo_url|          
              # make sure we don't process the same image again.
              hashed = Digest::SHA1.hexdigest(photo_url)         
              next if @photo_urls.include? hashed
              @photo_urls << hashed
              # change user agent and store in hash
              my_uri = URI.parse(photo_url)
              request = Net::HTTP::Head.new(my_uri.path)
              request.initialize_http_header({"User-Agent" => "My Downloader"})
              requests[hashed] = request
            end
            # process requests (send array of values - ie. requests) in a pipeline.
            http.pipeline requests.values do |response|
              if response.code=="200"
                  # anyway to reference the hash here so I can decide whether
                  # I want to do anything later?
              end
            end                
          end

Finally, if there is an easier way of doing this, please feel free to offer any suggestions.

Thanks!


Solution

  • Make requests an array instead of a hash and pop off the requests as the responses come in:

    Net::HTTP.start uri.host do |http|
      # Init HTTP requests array
      requests = []
      photo_urls.each do |photo_url|          
        # make sure we don't process the same image again.
        hashed = Digest::SHA1.hexdigest(photo_url)         
        next if @photo_urls.include? hashed
        @photo_urls << hashed
    
        # change user agent and store in hash
        my_uri = URI.parse(photo_url)
        request = Net::HTTP::Head.new(my_uri.path)
        request.initialize_http_header({"User-Agent" => "My Downloader"})
        requests << request
      end
    
      # process requests (send array of values - ie. requests) in a pipeline.
      http.pipeline requests.dup do |response|
        request = requests.shift
    
        if response.code=="200"
          # Do whatever checking with request
        end
      end                
    end