Search code examples
azurehiveazure-hdinsightambari

HDInsight and Hive queries


We are doing a POC for HDInsight. I’m very new to this technology. What we are doing is, trying to send some data to Azure and write a few Hive queries. We are able to achieve the first part: we could push some test data using AzCopy to Azure blob. (I understand there’s Azure Tables and Azure queues). But for the POC, Azure blob is just fine.

We can use Visual Studio to talk to this blob. However, we also want to check HDinsight and its MapReduce functionality.

With this background, here are couple of questions:

 1. Do I need to copy data from Azure Blob to Anywhere else for writing
    Hive queries in Ambari? Or Can Ambari directly talk to data stored
    in Azure blob? 
 2. Is this the right way to process data? (Keep data in
        Azure blob, and use HDInsight/Ambari to process the data)
 3. If point 2 is correct, that means HDInsight is used only for
    parallel processing with MapReducing feature. Is this correct?

Thank you so much, for any insight.


Solution

    1. Yes, HDInsight can read the data stored in BLOB store. Examples:

    https://learn.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-linux-tutorial-get-started https://blogs.msdn.microsoft.com/azuredatalake/2017/04/06/azure-hdinsight-3-6-five-things-that-will-make-data-developer-happy/

    1. Yes, depending upon what you want to do, you can use Spark, MR, Pig or Hive to process the data Good starting point is here https://www.edx.org/course/processing-big-data-with-hadoop-in-azure-hdinsight

    3: Yes, data is processed using one of the distributed frameworks such as Spark, Map Reduce, Hive or Pig