Search code examples
hadoopmapreducecombiners

In-mapper combining and when does cleanup routine of mapper executes?


I'm trying simple bi-gram (word pair) count, I tried with simple "pair" approach, now I just modified to try "Stripes" approach, but in Cleanup routine of mapper, somehow my all keys are same word pair (as in last word pair!) and counts.

e.g. text input is:

My name is foo. Foo is new to Hadoop.

In mapper my hashmap looks like:

((my, name), 1), ((name, is), 1), ((is, foo), 2), ((is, new), 1), ((new, to), 1), ((to, hadoop), 1)

But in Cleanup routine, I tried to print same hashmap, it looks like

((to, hadoop), 1), ((to, hadoop), 1), ((to, hadoop), 2), ((to, hadoop), 1), ((to, hadoop), 1), ((to, hadoop), 1)

My code looks like:

Map Class:
private HashMap<TextPair, Integer> h = new HashMap<TextPair, Integer>();;

void map(...) :
    ...
StringTokenizer itr = new StringTokenizer(value.toString());            
left = itr.nextToken();
while(itr.hasMoreTokens()) {
right = itr.nextToken();

if(left != null && right!= null) {
            **//I have to create new TextPair(key object) each time!** 
    key.set(new Text(left.toLowerCase()), new Text(right.toLowerCase()));
    //If key is there, just do count + 1 else add key with value 1
    if(h.containsKey(key)) {
            int total = h.get(key) + 1;         
        h.put(key, total);
    } else {
        System.out.println("key: "+ key.toString()+ " => 1");                       
        h.put(key, 1);
    }
            //context.write(key, one);
    }
    left = right;
}
    ....

void cleanup(...):
   Iterator<Entry<TextPair, Integer>> itr = h.entrySet().iterator();
   while(itr.hasNext()) {
    Entry<TextPair, Integer> entry = itr.next();
    TextPair key = entry.getKey();
    int total = entry.getValue().intValue();
    System.out.println("--- MAP CLEANUP ---: key: "+ key.toString() + " => Total: "+ total);

    context.write(key, new IntWritable(total));
}
...

Note: TextPair is my custom key class. Any suggestion?

EDIT 1:

does cleanup routine of Map, executed at last after all map tasks are done? And hash is kind of "global", something wrong with that or my iterator?

EDIT 2:

I have to create new TextPair Key object at each iteration in map() before hashing, that's what the issue was .. its solved, but wondering why so? I used hash with Python so many times, its good, no pain, why I need to create new object each time, I don't understand.


Solution

  • It seems that you don't create new key each time, but reuse the one. So you got the same distribution in both cases and the last key in first set is used everywhere in the second set.