Hadoop rack topology

In Hadoop I've read that the rack topology can be configured by supplying IP addresses of the racks or their associated host names. Does that mean that within one Hadoop cluster you could theoretically have different racks in completely separate geographical locations as long as they are reachable (can be pinged) from the NameNode?

If that is the case I would assume the replication strategy of the blocks defined by the rack awareness algorithm would be the same.

Solution

Hadoop is rack-aware by default, and all nodes fall under a single rack called as the /default-rack. If the cluster has multiple racks, within a datacenter or spanning across datacenters, hadoop components (especially Namenode) must be made aware of these different racks.

In Hadoop I've read that the rack topology can be configured by supplying IP addresses of the racks or their associated host names.

Yes, a script that can generate the mapping between the datanode's IP or hostname to a maximum of one rack will be required to configure rack topology.

Does that mean that within one Hadoop cluster you could theoretically have different racks in completely separate geographical locations as long as they are reachable (can be pinged) from the NameNode?

Yes, they should be reachable by all nodes in the cluster. But as a best practice, it is not recommended to have nodes across different geographical locations as this could increase the network latency between the nodes.

If that is the case I would assume the replication strategy of the blocks defined by the rack awareness algorithm would be the same.

Yes. The block placement policy is same for all variations in rack topology.