Search code examples
hadoophadoop2bigdata

What is local DataNode?


I am reading a hadoop module made by yahoo at https://developer.yahoo.com/hadoop/tutorial/module2.html and it is mentioned here "local DataNode" I would like to know what exactly is a local DataNode. My guess is a machine that is a NameNode and at the same time a DataNode but I want assurance of what it really is.


Solution

  • In Hadoop, by default, each block of data is copied 3 times (replication factor of 3).

    To ensure the availability and durability of data, Hadoop places replicas in 3 different Data Nodes:

    • Local Data Node: The Data Node where the client initiates a write (for e.g. using hadoop fs -cp command). The first replica is placed here. If the client is writing the data from outside the cluster, then this node is chosen at random. It is the node on which the first replica gets written.
    • Off-rack Data Node: The Data Node, which is present on another rack. The second replica is placed here.
    • On-Rack Data Node: The Data Node which is physically present on the same rack as the first Data Node. Third replica is placed here

    This ensures that, even if one rack goes down, the data is still available on a Data Node present in another rack.

    So in this tutorial, local Data Node means, the Data Node which initiated the write operation.

    Let's take an example. Let's assume that you are trying to copy a file a.txt into HDFS. Let's assume that a cluster has 3 racks and is rack-aware:

    Rack 1: Node 1, Node 2
    Rack 2: Node 3, Node 4
    Rack 3: Node 5, Node 6
    
    Also, you have another Node: Node 7, which is outside the Hadoop cluster, but is connected 
    to the cluster and you can perform HDFS operations.
    

    Case 1: Client inside the cluster

    Let's assume that you execute hadoop fs -copyFromLocal a.txt /tmp/ from Node 1 (which is on Rack 1). Then Hadoop will try to place the replicas as follows:

    • First replica is placed on Node 1. This is Local Data Node for the client
    • Second replica is placed on either Rack 2 (Node 3 or Node 4) or Rack 3 (Node 5 or Node 6). This is Off-Rack Data Node.
    • Third replica is placed on Node 2. This is On-Rack Data Node.

    Case 2: Client outside the cluster

    Let's assume that you execute hadoop fs -copyFromLocal a.txt /tmp/ from Node 7 (which is not part of the cluster and the client runs on it). Then Hadoop will try to place the replicas as follows:

    • It will randomly pick one of the nodes (any of the Nodes from Node 1 to Node 6). Then this node will become Local Data Node. Let's assume it picks Node 6, which is on Rack 3.
    • Now, the second replica is placed either on Rack 1 (Node 1 or Node 2) or Rack 2 (Node 3 or Node 4). This is Off-Rack Data Node.
    • Third replica is placed on Node 5. This is On-Rack Data Node

    This is how ideally replica placement should happen. But, this depends on the space available on different racks and nodes.