I have a code that reads files from FTP server and writes it into HDFS
. I have implemented a customised InputFormatReader
that sets the isSplitable
property of the input as false
.However this gives me the following error.
INFO mapred.MapTask: Record too large for in-memory buffer
The code I use to read data is
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
}
Any ideas how to avoid java heap space error
without splitting the input file ? Or in case I make isSplitable
true
how do I go about reading the file ?
If I got you right - you load the whole file in memory. Unrelated to hadoop - you can not do it on Java and be sure that you have enough memory.
I would suggest to define some resonable chunk and make it to be "a record"