apache-spark hadoop mapreduce hdfs distributed-computing

Hadoop/Spark : How replication factor and performance are related?

Without discussing all other performance factors, the disk space and the Name node objects, how can replication factor emproves the performance of MR, Tez and Spark.

If we have for example 5 datanades, does it better for the execution engine to set the replication to 5 ? Whats the best and the worst value ?

How this can be good for aggregations, joins, and map-only jobs ?

Solution

One of the major tenants of Hadoop is moving the computation to the data.

If you set the replication factor approximately equal to the number of datanodes, you're guaranteed that every machine will be able to process that data.

However, as you mention, namenode overhead is very important and more files or replicas causes slow requests. More replicas also can saturate your network in an unhealthy cluster. I've never seen anything higher than 5, and that's only for the most critical data of the company. Anything else, they left at 2 replicas

The execution engine doesn't matter too much other than Tez/Spark outperforming MR in most cases, but what matters more is the size of your files and what format they are stored in - that will be a major drive in execution performance