Search code examples
javahadoopmapreducedistributed-computing

Change File Split size in Hadoop


I have a bunch of small files in an HDFS directory. Although the volume of the files is relatively small, the amount of processing time per file is huge. That is, a 64mb file, which is the default split size for TextInputFormat, would take even several hours to be processed.

What I need to do, is to reduce the split size, so that I can utilize even more nodes for a job.

So the question is, how is it possible to split the files by let's say 10kb? Do I need to implement my own InputFormat and RecordReader for this, or is there any parameter to set? Thanks.


Solution

  • The parameter mapred.max.split.size which can be set per job individually is what you looking for. Don't change dfs.block.size because this is global for HDFS and can lead to problems.