I want to run a hadoop unit test, using the local filesystem mode... I would ideally like to see several part-m-* files written out to disk (rather than just 1). However, since it just a test, I dont want to process 64M of data (the default size is ~64megs per block, i believe).
In distributed mode we can set this using
dfs.block.size
I am wondering wether there a way that i can get my local file system to write small part-m files out, i.e. so that my unit test will mimic the contents of large scale data with several (albeit very small) files.
Assuming your input format can handle splitable files (see the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.isSplitable(JobContext, Path)
method), you can amend the input split size to process a smaller file with multi mappers (i'm going to assume you're using the new API mapreduce package):
For example, if you're using the TextInputFormat
(or most input formats that extend FileInputFormat
), you can call the static util methods:
FileInputFormat.setMaxInputSplitSize(Job, long)
FileInputFormat.setMinInputSplitSize(Job, long)
The long argument is the size of the split in bytes, so just set to you're desired size
Under the hood, these methods set the following job configuration properties:
mapred.min.split.size
mapred.max.split.size
Final note, some input formats may override the FileInputFormat.getFormatMinSplitSize()
method (which defaults to 1 byte for FileInputFormat
), so be weay if you set a value and hadoop is appearing to ignore it.
A final point - have you considered MRUnit http://incubator.apache.org/mrunit/ for actual 'unit' testing of your MR code?