Search code examples
hadoopreplicationhdfslocation-aware

HDFS' Location Awareness


Introduction

According to several documentation 1, 2, 3 HDFS' Location Awareness is about knowing the physical location of nodes and replicating data on different racks to reduce the impact of rack issues due to, e.g. power supply and/or switch issues.

Question

How does HDFS know the physical location of nodes and racks and subsequently decide to replicate data to nodes located on other racks?


Solution

  • Rack-awareness is configured when the cluster is set up. This can be done either manually for each node or through a script.

    Each DataNode is given a network location which is simple a string, much like a file system path.

    Example:

    datacenter-1/rack-1/node1
    datacenter-1/rack-1/node2
    datacenter-1/rack-2/node3
    

    The NameNode then builds a network topology (basically a tree structure) using the network locations of each DataNode. This topology is then used to determine block replica placement.