I'm running Nutch on a Elastic MapReduce, with 3 worker nodes. I'm using Nutch 1.4, with the default configuration it ships with (after adding a user agent).
However, even though I'm crawling a list of 30,000 domains the fetching step is only run from one worker node, even though the parsing step runs on all three.
How do I get it to run the fetch step from all three nodes?
*EDIT* The problem was that I needed to set the mapred.map.tasks property to the size of my Hadoop cluster. You can find this documented here
By default nutch partitions urls based in their hosts. The corresponding property in nutch-default.xml
is:
<property>
<name>partition.url.mode</name>
<value>byHost</value>
<description>Determines how to partition URLs. Default value is 'byHost',
also takes 'byDomain' or 'byIP'.
</description>
</property>
Please verify the value on your setup.
I think that your problem can be diagnosed by getting answers for these questions: