Search code examples
javahadoopmapreducenutch

Nutch hadoop map reduce java heap space outOfMemory


I am running a Nutch 1.16, Hadoop 2.83, Solr 8.5.1 crawler setup that is running fine up to a few million indexed pages. Then I am running into Java Heap Space issues during the MapReduce job and I just cannot seem to find the correct way to up that heap space. I have tried:

  1. Passing -D mapreduce.map.memory.mb=24608 -D mapreduce.map.java.opts=-Xmx24096m when starting nutch crawl.
  2. Editing NUTCH_HOME/bin/crawl commonOptions mapred.child.java.opts to -Xmx16000m
  3. Setting HADOOP_HOME/etc/hadoop/mapred-site.xml mapred.child.java.opts to -Xmx160000m -XX:+UseConcMarkSweepGC
  4. Copying said mapred-site.xml into my nutch/conf folder

None of that seems to change anything. I run into the same Heap Space error at the same point in the crawling process. I have tried reducing the fetcher threads back to 12 from 25 and switching off parsing while fetching. Nothing changed and I am out of ideas. I have 64GB RAM so thats really not an issue. Please help ;)

EDIT: fixed filename to mapred-site.xml


Solution

    1. Passing -D ...

    The heap space needs to be set also for the reduce task using "mapreduce.reduce.memory.mb" and "mapreduce.reduce.java.opts". Note that the script bin/crawl was recently improved in this regard, see NUTCH-2501 and the recent bin/crawl script.

    3./4. Setting/copying hadoop-site.xml

    Shouldn't this be set in "mapred-site.xml"?