Search code examples
web-scrapingweb-crawlernutch

Does any open, simply extendible web crawler exists?


I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:

  • partly just to read the feeds of several sites
  • to scrape the content of these sites
  • if the site has an archive I would like to crawl and index it as well
  • the crawler should be capable to explore part of the Web for me and it should be able to decide which sites matches the given criteria
  • should be able to notify me, if things possibly matching my interest were found
  • the crawler should not kill the servers by attacking it by too many requests, it should be smart doing crawling
  • the crawler should be robust against freak sites and servers

Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?


Solution

  • A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
    Hope it goes well!