I have a series map-reduce jobs to process user data (implemented using the Cascading framework), and I would like to track lots of fine-grained statistics (I can have between 100 and 1000 users and 20 statistics per user, so, possibly between 5000 and 10.000 statistics in total). I wanted to use map-reduce counters to build those stats because it is very convenient to use them in the code, but there is a limit to the number of map-reduce counters (120 by default), and according to this post: http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/ I should not use them if i have more than 20/50 custom counters.
Question: is there a proper way to track my statistics in this map-reduce context, using a counter-like pattern ? by counter-like, i mean, to have access to counters everywhere in my code and be able to increment them where needed.
thanks by advance register
If your statistics are just counts and they get only incremented in the parallel stage, you could collect them separately for each instance and then add up together (reduce). This is the whole idea of MapReduce, actually.