Search code examples
hadoopmapreducedistributed-computing

Why does 3-node cluster has worse performance than single-node cluster?


I ran multiple tests with multiple files. (biggest file is 83,7 MB)

I know network brings some overhead but I was expecting better results since I thought the purpose of using a distributed system is to reduce the response time.

I measure performance with /usr/bin/time. What is the problem here?


Solution

  • If your mapreduce key is sent to a single node in the cluster, then you get no performance improvement over a single node and you add network overhead of data shuffles

    If you've not tuned your mapreduce YARN container sizes for your hardware, then you'll see poor performance.

    If you are storing lots of files less than the HDFS block size (128 MB, if you've left the default), as you mentioned, then you're wasting resources. Additionally, if you are processing a single large file like a ZIP, or other "non splittable" file format, you get no benefit over a single mapper node.

    I measure performance with /usr/bin/time

    The MapReduce Job output and History Server both tell you how long a job and its tasks actually take