Search code examples
hadoophadoop-partitioning

hadoop - how total mappers are determined


I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.

Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.

How is the total mappers determined as 28?

I added the below line into my wordcount.java program to check.

FileInputFormat.setMaxInputSplitSize(job, 2);

Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.

row1,row2,row3,row4,row5,row6.......row20

Should I split the input file into 20 different files each having only 2 rows?


Solution

  • That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.

    Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?