Search code examples
javahadoopfilesize

dfs.block.size for Local hadoop jobs ?


I want to run a hadoop unit test, using the local filesystem mode... I would ideally like to see several part-m-* files written out to disk (rather than just 1). However, since it just a test, I dont want to process 64M of data (the default size is ~64megs per block, i believe).

In distributed mode we can set this using

dfs.block.size

I am wondering wether there a way that i can get my local file system to write small part-m files out, i.e. so that my unit test will mimic the contents of large scale data with several (albeit very small) files.


Solution

  • Assuming your input format can handle splitable files (see the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.isSplitable(JobContext, Path) method), you can amend the input split size to process a smaller file with multi mappers (i'm going to assume you're using the new API mapreduce package):

    For example, if you're using the TextInputFormat (or most input formats that extend FileInputFormat), you can call the static util methods:

    • FileInputFormat.setMaxInputSplitSize(Job, long)
    • FileInputFormat.setMinInputSplitSize(Job, long)

    The long argument is the size of the split in bytes, so just set to you're desired size

    Under the hood, these methods set the following job configuration properties:

    • mapred.min.split.size
    • mapred.max.split.size

    Final note, some input formats may override the FileInputFormat.getFormatMinSplitSize() method (which defaults to 1 byte for FileInputFormat), so be weay if you set a value and hadoop is appearing to ignore it.

    A final point - have you considered MRUnit http://incubator.apache.org/mrunit/ for actual 'unit' testing of your MR code?