I am reading a hadoop module made by yahoo at https://developer.yahoo.com/hadoop/tutorial/module2.html and it is mentioned here "local DataNode" I would like to know what exactly is a local DataNode. My guess is a machine that is a NameNode and at the same time a DataNode but I want assurance of what it really is.
In Hadoop, by default, each block of data is copied 3 times (replication factor of 3).
To ensure the availability and durability of data, Hadoop places replicas in 3 different Data Nodes:
hadoop fs -cp
command). The first replica is placed here. If the client is writing the data from outside the cluster, then this node is chosen at random. It is the node on which the first replica gets written.This ensures that, even if one rack goes down, the data is still available on a Data Node present in another rack.
So in this tutorial, local Data Node means, the Data Node which initiated the write operation.
Let's take an example. Let's assume that you are trying to copy a file a.txt
into HDFS. Let's assume that a cluster has 3 racks and is rack-aware:
Rack 1: Node 1, Node 2
Rack 2: Node 3, Node 4
Rack 3: Node 5, Node 6
Also, you have another Node: Node 7, which is outside the Hadoop cluster, but is connected
to the cluster and you can perform HDFS operations.
Case 1: Client inside the cluster
Let's assume that you execute hadoop fs -copyFromLocal a.txt /tmp/
from Node 1 (which is on Rack 1). Then Hadoop will try to place the replicas as follows:
Case 2: Client outside the cluster
Let's assume that you execute hadoop fs -copyFromLocal a.txt /tmp/
from Node 7 (which is not part of the cluster and the client runs on it). Then Hadoop will try to place the replicas as follows:
This is how ideally replica placement should happen. But, this depends on the space available on different racks and nodes.