Nutch hadoop map reduce java heap space outOfMemory

I am running a Nutch 1.16, Hadoop 2.83, Solr 8.5.1 crawler setup that is running fine up to a few million indexed pages. Then I am running into Java Heap Space issues during the MapReduce job and I just cannot seem to find the correct way to up that heap space. I have tried:

Passing -D mapreduce.map.memory.mb=24608 -D mapreduce.map.java.opts=-Xmx24096m when starting nutch crawl.
Editing NUTCH_HOME/bin/crawl commonOptions mapred.child.java.opts to -Xmx16000m
Setting HADOOP_HOME/etc/hadoop/mapred-site.xml mapred.child.java.opts to -Xmx160000m -XX:+UseConcMarkSweepGC
Copying said mapred-site.xml into my nutch/conf folder

None of that seems to change anything. I run into the same Heap Space error at the same point in the crawling process. I have tried reducing the fetcher threads back to 12 from 25 and switching off parsing while fetching. Nothing changed and I am out of ideas. I have 64GB RAM so thats really not an issue. Please help ;)

EDIT: fixed filename to mapred-site.xml

Solution

Passing -D ...

The heap space needs to be set also for the reduce task using "mapreduce.reduce.memory.mb" and "mapreduce.reduce.java.opts". Note that the script bin/crawl was recently improved in this regard, see NUTCH-2501 and the recent bin/crawl script.

3./4. Setting/copying hadoop-site.xml

Shouldn't this be set in "mapred-site.xml"?