I am currently working on a project where i need to have an in memory structure for my map task. I have made some calculations and i can say that i dont need more than 600MB of memory for every map task. But the thing is that after a while i have java heap space problems or gc overhead limit. I don't know how can this be possible.
Here are some more details. I have two, quad-core system with 12GB of ram. So that means that i can have up to 8 map tasks running at the same time. I am building a tree, so i have an iterative algorithm that does a map-reduce job for every tree level. My algorithm works fine for small datasets, but for a medium dataset has heap space problems. My algorithm reaches a certain tree level and then it goes out of heap space, or has gc overhead problems. At that point, i made some calculations and i saw that every task doesnt need more than 100MB memory. So for 8 tasks, i am using about 800MB of memory. I don't know what is going on. I even updated my hadoop-env.sh file with these lines:
export HADOOP_HEAPSIZE=8000
export HADOOP_OPTS=-XX:+UseParallelGC
What is the problem? Does these lines even override the java options for my system? Using parallelGC is something that i saw on the internet and it was recommended when having multiple cores.
edits
Ok here are some edits after monitoring heap space and total memory. I consume about 3500MB of RAM when running 6 task at the same time. That means that jobtracker, tasktracker, namenode, datanode, secondary namenode my operating system and 6 tasks all use 3500 of RAM which is a very logical size. So why do i get a gc overhead limit? I follow the same algorithm for every tree level. The only thing that changes is the number of nodes in every tree level. Having many nodes in a tree level, does not add so much overhead to my algorithm. So why cant the gc work well?
If you maximum memory size hasn't changed, it will be 1/4 of main memory i.e. about 3 GB plus some overhead for non-heap usage could be 3.5 GB.
I suggest you try
export HADOOP_OPTS="-XX:+UseParallelGC -Xmx8g"
to set the maximum memory to 8 GB.
By default the maximum heap size is 1/4 of the memory (unless you are running a 32-bit JVM on Windows). So if the maximum heap size is being ignored it will still be 3 GB.
Whether you use one GC or another, it won't make much difference to when you run out of memory.
I suggest you take a heap dump with -XX:+HeapDumpOnOutOfMemoryError
and read this in a profiler, e.g. VisualVM to see why it's using so much memory.