performance hadoop mapreduce bigdata metrics

Which metrics to measure the efficiency of a MapReduce application?

I wrote a MapReduce application which run on 6 nodes of computers. I am sure that my MapReduce algorithm (run on a cluster of computers) outperforms the sequential algorithm (run on a single computer), but I think this does not mean that my MapReduce algorithm is efficiently enough, right?

I have searched around and found: speedup, scaleup, and sizeup metrics. Is it true that we normally consider these metrics when measuring the efficiency of MapReduce application? Is there any metric that we need to consider?

Thank you a lot.

Solution

Before specifically addressing your question, let's revisit the map-reduce model and see what's the real problem, it tries to solve. You can refer this answer (by me/ of course you can refer other good answers for the same problem), to get an idea of map-reduce model.

So what it really tries to solve? It deduces a generic model that can be applied to solve vast a range of problems that needs to process a massive amount of data (usually in GBs or even Peta Bytes). And the real deal of this model is, it can be easily parallelized and can even be easily distributed the execution among number of nodes. This article (by me) has some detailed explanation of whole model.

So let's go to your question, you are asking about measuring the efficiency of a map reduce program based on speed, memory-efficiency and scalability.

Speaking to the point, the efficiency of a map-reduce program always depend on how far it enjoys the parallelism given by the underlying computational power. This directly indicates that a map-reduce program runs on one cluster may not be the ideal program to run in a different cluster. So we need to have a good idea of our cluster, if we hope to build up our program to a precisely fine-tuned level. But practically its rare some one needs to get it tuned up to that much level.

Let's take your points one by one:

Speed up: It depends on how you split your input to different portions. This directly deduces the amount of parallelism (in human control). So as I mentioned above, the speed-up directly depends on how your split logic going to be able to utilize your cluster.
Memory efficiency: It mostly depends on how memory efficient your mapper logic and reducer logic are.
Scalability: This is mostly out of concern. You can see that the map-reduce model is already highly scalable to a level that one would rarely think about an extra mile.

So speaking as a whole, efficiency of a map reduce program is rarely a concern (even speed and memory). Practically speaking the most valuable metric is the quality of its output. i.e. how good your analytic data are. (in place of marketing, research etc.)