Search code examples
hadoopakka-stream

Can a hadoop client leverage the benefit of rack awareness?


I have 10 ingestion machines which use akka stream for data ingestion. I have a Hadoop cluster of 50 nodes, and run pipelines using Spark Streaming. Hadoop cluster uses the data generated by 10 machines for producing reports. Can I leverage rack awareness from those 10 machines without adding them as part of a Hadoop cluster?

When I say rack awareness, I mean if those machines are in the same rack as Hadoop data nodes, so using rack awareness, I would want each ingestion machine to upload data to it's nearest datanode instead of random manner, so that I would have less network traffic.

Please let me know if that's possible.


Solution

  • If I understood your setup correctly, this should happen automagically. According to HDFS Architecture:

    For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode in the same rack as that of the writer, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack.

    (highlighted is whats relevant to your case if your ingest nodes are not cluster datanodes.)