Search code examples
hadoopazure-hdinsightazure-compute-emulator

How does a Azure based Hadoop cluster (HDInsight) translate to classic on-premise Hadoop cluster


Apache Hadoop is designed to run across a bunch of commodity machines (nodes). This was not designed to be run in a cloud based complex scenarios. But, because cloud allows simulation of individual nodes through VMs , cloud based Hadoop clusters emerged. But that presents an understanding difficulty for me. When I study any standard explanation of a Hadoop cluster it's always on-prem architecture because all the Hadoop architecture is explained in terms of logical & simple on-prem view in mind. But this presents difficulty in understanding how a cloud based cluster works -- especially concepts such as HDFS, data locality etc. In on-prem version of explanation every node has its own 'local' storage (it also implies that storage hardware is fixed for a specific node, it doesn't get shuffled) and it's also is not assumed that the node is ever deleted. Also, we treat that storage as part of the node itself so we never think of killing a node and retaining storage for later use.

Now in cloud based Hadoop(HDInsight) model we can attach any Azure Storage account as primary storage for the cluster. So lets say if we have a cluster with 4 worker nodes and 2 head nodes , that single Azure Storage account acts as HDFS space for 6 virtual machines? And again , actual business data is not even stored on that -- it's stored on additional attached storage accounts. So I am not able to understand how does this get translated to on-prem Hadoop cluster? The core design of Hadoop cluster revolves around the concept of data locality , that data resides closest to processing. I know that when we create HDInsight cluster we create it in the same region as the storage accounts being attached. But it's more like multiple processing units (VMs) all sharing common storage rather than individual nodes with their own local storage. Probably , as long as it can accesses data fast enough (as though it resided locally) in the data center , it should not matter. But not sure if that's the case. The cloud based model presents the following picture to me:-

enter image description here

Can someone explain exactly how Apache Hadoop design gets translated into Azure based model ? The confusion arises from the fact that storage accounts are fixed and we can kill/spin cluster any time we want pointing to the same storage accounts.


Solution

  • When HDInsight is performing its task, it is streaming data from the storage node to the compute node. But many of the map, sort, shuffle,and reduce tasks that Hadoop is performing is being done on the local disk residing with the compute nodes themselves.

    The map, reduce, and sort tasks typically will be performed on compute nodes with minimal network load while the shuffle tasks will use some network to move the data from the mappers nodes to less reduce nodes.

    The final step of storing the data back to the storage is typically a much smaller dataset (e.g. a query dataset or report). In the end, the network is being more heavily utilized during the initial and final streaming phases while most of the other tasks are being performed intra-nodally (i.e. minimal network utilization).

    To understand more in detail, you may check out the "Why use Blob Storage with HDInsight on Azure" and "HDInsight Architecture".