Search code examples
nutchemr

Nutch numSlaves parameter in crawl script


I am using Nutch 1.9 to crawl a set of 500 websites. I am running nutch in Amazon EMR cluster and indexing the data to Solr.

While starting an EMR cluster I have started with 5 slave nodes. I have specified the numSlaves parameter to 5 in crawl script. I would like to increase my slaves to 10 to fasten the process. I am able to increase the number of slave nodes in the AWS console to 10. Will the nutch utilize all the 10 slave nodes without restarting my crawl or modifying the crawl script.

Thanks


Solution

  • Nope. You'll need to modify the crawl script and restart it. No big deal though, just SSH to the master node and create a file .STOP in runtime/deploy/bin. This will stop the crawl loop when the current iteration is complete. You can then restart the script after setting the value to 10.

    BTW you'd get quicker answers by asking on the Nutch mailing lists