Search code examples
javahadoopmapreducedistributed-computingelastic-map-reduce

Reducer node takes a long time to receive its records


When I checked the Hadoop GUI, I found that some of the reduce tasks have reached 66.66%, and they stay there for a long time. When I checked the counters, I found that the no. of input records is shown as zero.

After a long time, they get their input records, start processing them. Some show 0 input records in even for longer times and are killed by the Task Attempt failed to report status for 600 ms.

But some of the reducers show input records in their counters immediately and start processing them right away.

I do not know, why there is so much delay in the getting the input records for some reducers. This happens only with this program, and not the other programs.

In this mapreduce job, I have, in the configure method before the reduce method of the reduce, I read a lot of data from distributed cache. Is this the reason? I am not sure.


Solution

  • Yes I believe the reading from distributed cache is the reason for your delay. But it isn't going to make a difference if you keep configure() before or after the reduce() , as ultimately configure() method has to be called first, if you see the run() of the reducer it looks like follows (New API):

    public void run(Context context) throws IOException, InterruptedException {
    
        setup(context); // This is the counterpart of configure() from older API
    
        while (context.nextKey()) {
            reduce(context.getCurrentKey(), context.getValues(), context);
        }
        cleanup(context);
    }
    

    As you can see setup() is called before reduce(), and similarly in older API it would be that unless configure() finishes actual reduce task won't start (and this explains you not seeing any input records count shown).

    Now as for the percentage : 66%, you see that reduce phase has actually following sub-parts:

    1. Copy
    2. Sort
    3. Reduce

    So, since your first 2 steps were done and the third one had started but was waiting for the configure() to finish (distributed cache to be read), your reduce percentage was 66%.