I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.
Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.
How is the total mappers determined as 28?
I added the below line into my wordcount.java program to check.
FileInputFormat.setMaxInputSplitSize(job, 2);
Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.
row1,row2,row3,row4,row5,row6.......row20
Should I split the input file into 20 different files each having only 2 rows?
That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.
Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?