I'm trying to figure out what could be causing my emr job to run out of memory before it has even started processing my file inputs. I'm getting a "java.lang.OutOfMemoryError cannot be cast to java.lang.Exception" error before my RecordReader is even initialized (aka, before it even tried to unzip the files and process them). I am running my job on a directory with a large amount of inputs. I am able to run my job just fine on a smaller input set. Does anyone have any ideas?
I realized that the answer is that there was too much metadata overhead on the master node. The master node must store ~150 kb of data for each file that will be processed. With millions of files, this can be gigabytes of data, which was too much and caused the master node to crash.
Here's a good source for more information: http://www.inquidia.com/news-and-info/working-small-files-hadoop-part-1#sthash.YOtxmQvh.dpuf