Search code examples
hadoopnutchelastic-map-reduce

Why does Nutch only run the fetch step on one Hadoop node, when the cluster has 5 nodes total?


I'm running Nutch on a Elastic MapReduce, with 3 worker nodes. I'm using Nutch 1.4, with the default configuration it ships with (after adding a user agent).

However, even though I'm crawling a list of 30,000 domains the fetching step is only run from one worker node, even though the parsing step runs on all three.

How do I get it to run the fetch step from all three nodes?

*EDIT* The problem was that I needed to set the mapred.map.tasks property to the size of my Hadoop cluster. You can find this documented here


Solution

  • By default nutch partitions urls based in their hosts. The corresponding property in nutch-default.xml is:

    <property>
      <name>partition.url.mode</name>
      <value>byHost</value>
      <description>Determines how to partition URLs. Default value is 'byHost', 
      also takes 'byDomain' or 'byIP'. 
      </description>
    </property>
    

    Please verify the value on your setup.

    I think that your problem can be diagnosed by getting answers for these questions:

    1. How many mappers were created for the fetch job ? it might be possible that there were multiple mappers spawned and all of them got finished early except for one.
    2. What was the topN value used in generate command ? If this is low, then despite of having 30K pages, very less will be sent to the fetch phase.
    3. Had you used numFetchers option in the generate command ? This controls the number of maps created for the fetch job.
    4. How many reduces were generated for the generate-partition job ? If this value is 1, then only a single map will be created in fetch phase. The output of generate partition is given to fetch phase. Number of part files created by generate (ie. reducers for generate) is equal to the number of maps created for the fetch job.
    5. Whats the setting for mapred.map.tasks on your hadoop ? whats the corresponding value for reduce ?