Search code examples
rubymongodbmemory-leaksweb-crawleranemone

Ruby, Mongodb, Anemone: web crawler with possible memory leak?


I began to learn about web crawlers recently and I built a sample crawler with Ruby, Anemone, and Mongodb for storage. I'm testing the crawler on a massive public website with possibly billions of links.

The crawler.rb is indexing the correct information, although when I check the memory use in activity monitor it shows the memory constantly growing. I have only run the crawler for about 6-7 hours and the memory is showing 1.38GB for mongod and 1.37GB for the Ruby process. It seems to be growing about 100MB every hour or so.

It seems that I might have a memory leak? Is their a more optimal way I can achieve the same crawl without the memory escalating out of control so that it can run longer?

# Sample web_crawler.rb with Anemone, Mongodb and Ruby.

require 'anemone'

# do not store the page's body.
module Anemone
  class Page
    def to_hash
      {'url' => @url.to_s,
       'links' => links.map(&:to_s),
       'code' => @code,
       'visited' => @visited,
       'depth' => @depth,
       'referer' => @referer.to_s,
       'fetched' => @fetched}
    end
    def self.from_hash(hash)
      page = self.new(URI(hash['url']))
      {'@links' => hash['links'].map { |link| URI(link) },
       '@code' => hash['code'].to_i,
       '@visited' => hash['visited'],
       '@depth' => hash['depth'].to_i,
       '@referer' => hash['referer'],
       '@fetched' => hash['fetched']
      }.each do |var, value|
        page.instance_variable_set(var, value)
      end
      page
    end
  end
end


Anemone.crawl("http://www.example.com/", :discard_page_bodies => true, :threads => 1, :obey_robots_txt => true, :user_agent => "Example - Web Crawler", :large_scale_crawl => true) do | anemone |
  anemone.storage = Anemone::Storage.MongoDB

  #only crawl pages that contain /example in url
  anemone.focus_crawl do |page|
    links = page.links.delete_if do |link|
      (link.to_s =~ /example/).nil?
    end
  end

  # only process pages in the /example directory
  anemone.on_pages_like(/example/) do | page |
    regex = /some type of regex/
    example = page.doc.css('#example_div').inner_html.gsub(regex,'') rescue next

    # Save to text file
    if !example.nil? and example != ""
      open('example.txt', 'a') { |f| f.puts "#{example}"}
    end
    page.discard_doc!
  end
end

Solution

  • I am also having a problem with this, but I am using redis as a datastore.

    this is my crawler:

    require "rubygems"
    
    require "anemone"
    
    urls = File.open("urls.csv")
    opts = {discard_page_bodies: true, skip_query_strings: true, depth_limit:2000, read_timeout: 10} 
    
    File.open("results.csv", "a") do |result_file|
    
      while row = urls.gets
    
        row_ = row.strip.split(',')
        if row_[1].start_with?("http://")
          url = row_[1]
        else
          url = "http://#{row_[1]}"
        end 
        Anemone.crawl(url, options = opts) do |anemone|
          anemone.storage = Anemone::Storage.Redis
          puts "crawling #{url}"    
          anemone.on_every_page do |page| 
    
            next if page.body == nil 
    
            if page.body.downcase.include?("sometext")
              puts "found one at #{url}"     
              result_file.puts "#{row_[0]},#{row_[1]}"
              next
    
            end # end if 
    
          end # end on_every_page
    
        end # end crawl
    
      end # end while
    
      # we're done
      puts "We're done."
    
    end # end File.open
    

    I applied the patch from here to my core.rb file in the anemone gem:

    35       # Prevent page_queue from using excessive RAM. Can indirectly limit ra    te of crawling. You'll additionally want to use discard_page_bodies and/or a     non-memory 'storage' option
    36       :max_page_queue_size => 100,
    

    ...

    (The following used to be on line 155)

    157       page_queue = SizedQueue.new(@opts[:max_page_queue_size])
    

    and I have an hourly cron job doing:

    #!/usr/bin/env python
    import redis
    r = redis.Redis()
    r.flushall()
    

    to try and keep redis' memory usage down. I'm restarting a giant crawl now, so we'll see how it goes!

    I'll report back with results...