I am trying to run a hadoop streaming application in a hadoop2 cluster. I am using following configuration to launch the app
hadoop jar /usr/lib/hadoop2/share/hadoop/tools/lib/hadoop-streaming.jar \
-D mapred.job.name=step01_load_delta_customer_events \
-D mapreduce.input.fileinputformat.split.minsize=134217728 \
-D mapreduce.job.reduces=10 \
-D mapreduce.map.memory.mb=4704 \
-D mapreduce.map.java.opts=-Xmx4416m \
-D stream.map.input.ignoreKey=true \
-D mapreduce.map.output.compress=true \
-D mapreduce.output.fileoutputformat.compress=true \
-D mapreduce.output.fileoutputformat.compress.type=BLOCK \
-D mapred.max.map.failures.percent=7 \
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapreduce.map.output.compress.codec=com.hadoop.compression.lzo.LzoCodec \
-D mapred.output.committer.class=org.apache.hadoop.mapred.DirectFileOutputCommitter \
-D mapreduce.use.directfileoutputcommitter=true \
-files <file path> \
-mapper <mapper code in python> \
-reducer <reduce code in python> \
-input "$INPUT" \
-outputformat org.apache.hadoop.mapred.TextOutputFormat \
-output "$OUTPUT"
My input files are kept in AWS S3 and I have 5400 s3 objects in my input path. Input object size varies from 1MB to 100MB and the total input size is ~25GB. As per my input split size configuration I am expecting 200 mapper tasks running. But while running the app there are 5400 mapper tasks running which is exactly equal to number of s3 objects in my input. I think this is affecting the performance of my application. Can someone help me to understand this behaviour. Also how can I control the number of mappers in this case? My app is running in a qubole hadoop2 cluster.
The problem was with the input format. I used combineTextInputFormat instead of textInputFormat and the input splits works just fine.