Search code examples
azureazure-storagehadoop-streamingazure-hdinsightcortana-intelligence

HDInsight - Azure blob storage


I have some basic clarifications about azure hdInsight. The following article gives some basic input on using hdinsight. https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-emulator-get-started/.

It says that HDinsight internally uses azure blob storage . Having this in mind, my question is as follows:

I have a hdinsight hd1 which uses storage account stg1. If I want to just uploading and download files using azure storage explorer to stg1 , then whats the use of having hd1 , I can do it without even creating hdinsight which costs heavily. So, is hadoop hdinsight only used for processing some data stored in stg1 to produce some results like wordcount?Is that the only reason why we use HDInsight?


Solution

  • If you want to understand the HDInsight and blob storage better, you need to read https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/.

    HDInsight is Microsoft's implementation of Hadoop. So far there 4 different base types which include Hadoop, HBase, Storm, Spark. You can always install additional components to the base types.

    Your question is really about why using Hadoop. Hadoop shines when you need to process a lot of data - big data.

    One of the differences between HDInsight and other Hadoop implementations is the separation of storage (blob storage) from compute (HDInsight clusters). You would still need to copy the data (or store the data directly in Azure blob storage). When you are ready to process, you create an HDInsight cluster, submit a job, and then delete the cluster. You delete the cluster so you don't need to pay for the cluster anymore. Even after the cluster is deleted, your date stored in the Blob storage retains.