I want to process multiline CSV files and for that I wrote a custom CSVInputFormat.
I would like to have about 40 threads processing CSV lines on each hadoop node. However, when I create a cluster on Amazon EMR with 5 machines (1 master and 4 cores), I can see I get only 2 map tasks running, even if there are 6 available map slots:
I implemented getSplits in my inputFormat so it would behave like NLineInputFormat. I was expecting with this I would get more thing running in parallel, but have had no effect. Also, I tried setting arguments -s,mapred.tasktracker.map.tasks.maximum=10 --args -jobconf,mapred.map.tasks=10
, but no effect.
What can I do to have lines being processed in parallel? The way hadoop is running, it's not scalable, as doesn't matter how many instances I allocate to the cluster, only two map tasks will run at most.
UPDATE: When I use a non compressed file (zip) as origin, it create more map tasks, about 17 for 1.3 million rows. Even so, I wonder why it shouldn't be more and why more mappers aren't created when data is zipped.
Change the split size to have more splits.
Configuration conf= new Cofiguration();
//set the value that increases your number of splits.
conf.set("mapred.max.split.size", "1020");
Job job = new Job(conf, "My job name");