Introduction
According to several documentation 1, 2, 3 HDFS' Location Awareness is about knowing the physical location of nodes and replicating data on different racks to reduce the impact of rack issues due to, e.g. power supply and/or switch issues.
Question
How does HDFS know the physical location of nodes and racks and subsequently decide to replicate data to nodes located on other racks?
Rack-awareness is configured when the cluster is set up. This can be done either manually for each node or through a script.
Each DataNode
is given a network location which is simple a string, much like a file system path.
Example:
datacenter-1/rack-1/node1
datacenter-1/rack-1/node2
datacenter-1/rack-2/node3
The NameNode
then builds a network topology (basically a tree structure) using the network locations of each DataNode
. This topology is then used to determine block replica placement.