Search code examples
nutch

How to run nutch in production enviornment


I was experimenting some crawl cycles with nutch and would like to setup a distributed crawl environment. But I wonder how can I trigger nutch for incoming crawl requests in a production system. I read about nutch REST api. Is that the real option that I have ? Or can I run nutch as a continuously running distributed server by any other option ?

My preferred nutch version is nutch 1.12.


Solution

  • As sujen stated, there are two options for this :-

    1. Use REST api if you want to submit crawl requests to nutch remotely. Steps to get this running is described here :-

    How to run nutch server on distributed environment

    1. Otherwise you can run bin/crawl script from runtime/deploy to launch requests to nutch distributed using hadoop.