java hadoop garbage-collection heap-memory multicore

Hadoop: Heap space and gc problems

I am currently working on a project where i need to have an in memory structure for my map task. I have made some calculations and i can say that i dont need more than 600MB of memory for every map task. But the thing is that after a while i have java heap space problems or gc overhead limit. I don't know how can this be possible.

Here are some more details. I have two, quad-core system with 12GB of ram. So that means that i can have up to 8 map tasks running at the same time. I am building a tree, so i have an iterative algorithm that does a map-reduce job for every tree level. My algorithm works fine for small datasets, but for a medium dataset has heap space problems. My algorithm reaches a certain tree level and then it goes out of heap space, or has gc overhead problems. At that point, i made some calculations and i saw that every task doesnt need more than 100MB memory. So for 8 tasks, i am using about 800MB of memory. I don't know what is going on. I even updated my hadoop-env.sh file with these lines:

   export HADOOP_HEAPSIZE=8000
   export HADOOP_OPTS=-XX:+UseParallelGC

What is the problem? Does these lines even override the java options for my system? Using parallelGC is something that i saw on the internet and it was recommended when having multiple cores.

     edits

Ok here are some edits after monitoring heap space and total memory. I consume about 3500MB of RAM when running 6 task at the same time. That means that jobtracker, tasktracker, namenode, datanode, secondary namenode my operating system and 6 tasks all use 3500 of RAM which is a very logical size. So why do i get a gc overhead limit? I follow the same algorithm for every tree level. The only thing that changes is the number of nodes in every tree level. Having many nodes in a tree level, does not add so much overhead to my algorithm. So why cant the gc work well?

Solution

If you maximum memory size hasn't changed, it will be 1/4 of main memory i.e. about 3 GB plus some overhead for non-heap usage could be 3.5 GB.

I suggest you try

export HADOOP_OPTS="-XX:+UseParallelGC -Xmx8g"

to set the maximum memory to 8 GB.

By default the maximum heap size is 1/4 of the memory (unless you are running a 32-bit JVM on Windows). So if the maximum heap size is being ignored it will still be 3 GB.

Whether you use one GC or another, it won't make much difference to when you run out of memory.

I suggest you take a heap dump with -XX:+HeapDumpOnOutOfMemoryError and read this in a profiler, e.g. VisualVM to see why it's using so much memory.