csv hadoop amazon-emr hadoop-partitioning

How to increase hadoop map tasks by implementing getSplits

I want to process multiline CSV files and for that I wrote a custom CSVInputFormat.

I would like to have about 40 threads processing CSV lines on each hadoop node. However, when I create a cluster on Amazon EMR with 5 machines (1 master and 4 cores), I can see I get only 2 map tasks running, even if there are 6 available map slots:

dashboard on EMR showing number of map tasks and available slots

I implemented getSplits in my inputFormat so it would behave like NLineInputFormat. I was expecting with this I would get more thing running in parallel, but have had no effect. Also, I tried setting arguments -s,mapred.tasktracker.map.tasks.maximum=10 --args -jobconf,mapred.map.tasks=10, but no effect.

What can I do to have lines being processed in parallel? The way hadoop is running, it's not scalable, as doesn't matter how many instances I allocate to the cluster, only two map tasks will run at most.

UPDATE: When I use a non compressed file (zip) as origin, it create more map tasks, about 17 for 1.3 million rows. Even so, I wonder why it shouldn't be more and why more mappers aren't created when data is zipped.

Solution

Change the split size to have more splits.

Configuration conf= new Cofiguration();
//set the value that increases your number of splits.
conf.set("mapred.max.split.size", "1020");
Job job = new Job(conf, "My job name");