I have been used HADOOP 1.2.1 server, and execute many pig jobs there. And recently, I considered to change my Hadoop server to HADOOP 2.2.0. So I tried some pig jobs in HADOOP 2.2.0, as I did in HADOOP 1.2.1 version.
But one thing I hardly understand in YARN MR2, is that Only ONE reduce job scheduled in every mr job.
At first time, I think that ok, reduce is faster than mr1, because Resource manager efficiently scheduled reduce job by handling it in only one server.
But in every big size mr job, YARN MR2 allocate Only ONE Reduce job scheduled every time.
Below is the Extream case.
Kind Total Tasks(successful+failed+killed) Successful tasks Failed tasks Killed tasks Start Time Finish Time Setup 1 1 0 0 27-Jan-2014 18:01:45 27-Jan-2014 18:01:46 (0sec) Map 2425 2423 0 2 27-Jan-2014 18:01:26 27-Jan-2014 19:08:58 (1hrs, 7mins, 31sec) Reduce 166 163 0 3 27-Jan-2014 18:04:35 27-Jan-2014 20:40:15 (2hrs, 35mins, 40sec) Cleanup 1 1 0 0 27-Jan-2014 20:40:16 27-Jan-2014 20:40:17 (1sec)
It takes 2 hour and 38 minute.
Job Name: PigLatin:DefaultJobName User Name: hduser Queue: default State: SUCCEEDED Uberized: false Started: Tue Jan 28 16:09:41 KST 2014 Finished: Tue Jan 28 21:47:45 KST 2014 Elapsed: 5hrs, 38mins, 4sec Diagnostics: Average Map Time 41sec Average Reduce Time 3hrs, 48mins, 23sec Average Shuffle Time 1hrs, 36mins, 35sec Average Merge Time 1hrs, 27mins, 38sec ApplicationMaster Attempt Number Start Time Node Logs 1 Tue Jan 28 16:09:39 KST 2014 awdatanode2:8042 logs Task Type Total Complete Map 1172 1172 Reduce 1 1 Attempt Type Failed Killed Successful Maps 0 1 1172 Reduces 0 0 1
It takes 5 hour and 38 minutes.
Although My Old Hadoop server has poor resouce, It's much faster than New Hadoop. because reduce jobs distributed. On the other end, HADOOP 2.2.0 server has rich resources, and, map was much faster than old system, but the reduce takes terribly long time.
Hadoop 2.2 memory configured as Map (4G, heap space 3G) and Reduce (8G, heap space 6G). and I tried various configurations set. but result was always one reduce job.
So I examined the pig source code.
The reason My Pig job always make One reduce job is that the InputSizeReducerEstimator class cannot access the hdfs file system.
// line 79 of InputSizeReducerEstimator.java List poLoads = PlanHelper.getPhysicalOperators(mapReduceOper.mapPlan, POLoad.class);
the result poLoads always 0 size.
so my reduce job always estimated to one.
I solve this problem by rebuild pig-0.12.1-h2.jar build.
I asked pig user group... and they patched at