Search code examples
javahadoopmapreduceelastic-map-reduceamazon-emr

Reading large files using mapreduce in hadoop


I have a code that reads files from FTP server and writes it into HDFS. I have implemented a customised InputFormatReader that sets the isSplitable property of the input as false .However this gives me the following error.

INFO mapred.MapTask: Record too large for in-memory buffer

The code I use to read data is

Path file = fileSplit.getPath();
                FileSystem fs = file.getFileSystem(conf);
                FSDataInputStream in = null;
                try {
                    in = fs.open(file);


                    IOUtils.readFully(in, contents, 0, contents.length);

                    value.set(contents, 0, contents.length);

                }

Any ideas how to avoid java heap space error without splitting the input file ? Or in case I make isSplitable true how do I go about reading the file ?


Solution

  • If I got you right - you load the whole file in memory. Unrelated to hadoop - you can not do it on Java and be sure that you have enough memory.
    I would suggest to define some resonable chunk and make it to be "a record"