Search code examples
rubyweb-crawleranemone

Ruby Anemone spider adding a tag to each url visited


I have a crawl set up:

require 'anemone'

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
  puts page.url
end
end

However I want the spider to use a Google-analytics anti-tracking tag on every URL it visits and not necessarily actually click the links.

I could use the spider once and store all of the URL's and use WATIR to run through them adding the tag but I want to avoid this because it is slow and I like the skip_links_like and page depth functions.

How could I implement this?


Solution

  • You want to add something to the URL before you load it, correct? You can use focus_crawl for that.

    Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
        anemone.focus_crawl do |page|
            page.links.map do |url|
                # url will be a URI (probably URI::HTTP) so adjust
                # url.query as needed here and then return url from
                # the block.
                url
            end
        end
        anemone.on_every_page do |page|
            puts page.url
        end
    end
    

    The focus_crawl method intended to filter the URL list:

    Specify a block which will select which links to follow on each page. The block should return an Array of URI objects.

    but you can use it as a general purpose URL filter as well.

    For example, if you wanted to add atm_source=SiteCon&atm_medium=Mycampaign to all the links then your page.links.map would look something like this:

    page.links.map do |uri|
        # Grab the query string, break it into components, throw out
        # any existing atm_source or atm_medium components. The to_s
        # does nothing if there is a query string but turns a nil into
        # an empty string to avoid some conditional logic.
        q = uri.query.to_s.split('&').reject { |x| x =~ /^atm_(source|medium)=/ }
    
        # Add the atm_source and atm_medium that you want.
        q << 'atm_source=SiteCon' << 'atm_medium=Mycampaign'
    
        # Rebuild the query string 
        uri.query = q.join('&')
    
        # And return the updated URI from the block
        uri
    end
    

    If you're atm_source or atm_medium contain non-URL safe characters then URI-encode them.