Search code examples
rubyweb-crawleranemone

HTTP Basic Authentication with Anemone Web Spider


I need collect all "title" from all pages from site.
Site have HTTP Basic Auth configuration.
Without auth I do next:

require 'anemone'
Anemone.crawl("http://example.com/") do |anemone|
  anemone.on_every_page do |page|
    puts page.doc.at('title').inner_html rescue nil
  end
end

But I have some problem with HTTP Basic Auth...
How I can collected titles from site with HTTP Basic Auth?
If I try use "Anemone.crawl("http://username:[email protected]/")" then I have only first page title, but other links have http://example.com/ style and I received 401 error.


Solution

  • HTTP Basic Auth works via HTTP headers. Client, willing to access restricted resource, must provide authentication header, like this one:

    Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
    

    It contains name and password, Base64-encoded. More info is in Wikipedia article: Basic Access Authentication.

    I googled a little bit and didn't find a way to make Anemone accept custom request headers. Maybe you'll have more luck.

    But I found another crawler that claims it can do it: Messie. Maybe you should give it a try

    Update

    Here's the place where Anemone sets its request headers: Anemone::HTTP. Indeed, there's no customization there. You can monkeypatch it. Something like this should work (put this somewhere in your app):

    module Anemone
      class HTTP
        def get_response(url, referer = nil)
          full_path = url.query.nil? ? url.path : "#{url.path}?#{url.query}"
    
          opts = {}
          opts['User-Agent'] = user_agent if user_agent
          opts['Referer'] = referer.to_s if referer
          opts['Cookie'] = @cookie_store.to_s unless @cookie_store.empty? || (!accept_cookies? && @opts[:cookies].nil?)
    
          retries = 0
          begin
            start = Time.now()
            # format request
            req = Net::HTTP::Get.new(full_path, opts)
            response = connection(url).request(req)
            finish = Time.now()
            # HTTP Basic authentication
            req.basic_auth 'your username', 'your password' # <<== tweak here
            response_time = ((finish - start) * 1000).round
            @cookie_store.merge!(response['Set-Cookie']) if accept_cookies?
            return response, response_time
          rescue Timeout::Error, Net::HTTPBadResponse, EOFError => e
            puts e.inspect if verbose?
            refresh_connection(url)
            retries += 1
            retry unless retries > 3
          end
        end
      end
    end
    

    Obviously, you should provide your own values for the username and password params to the basic_auth method call. It's quick and dirty and hardcode, yes. But sometimes you don't have time to do things in a proper manner. :)