I am running a Nutch 1.16, Hadoop 2.83, Solr 8.5.1 crawler setup that is running fine up to a few million indexed pages. Then I am running into Java Heap Space issues during the MapReduce job and I just cannot seem to find the correct way to up that heap space. I have tried:
-D mapreduce.map.memory.mb=24608 -D mapreduce.map.java.opts=-Xmx24096m
when starting nutch crawl.-Xmx16000m
-Xmx160000m -XX:+UseConcMarkSweepGC
None of that seems to change anything. I run into the same Heap Space error at the same point in the crawling process. I have tried reducing the fetcher threads back to 12 from 25 and switching off parsing while fetching. Nothing changed and I am out of ideas. I have 64GB RAM so thats really not an issue. Please help ;)
EDIT: fixed filename to mapred-site.xml
- Passing
-D ...
The heap space needs to be set also for the reduce task using "mapreduce.reduce.memory.mb" and "mapreduce.reduce.java.opts". Note that the script bin/crawl was recently improved in this regard, see NUTCH-2501 and the recent bin/crawl script.
3./4. Setting/copying hadoop-site.xml
Shouldn't this be set in "mapred-site.xml"?