Search code examples
hadoopopenstackdata-analysis

Hadoop on top of OpenStack, what extra features do I get?


I want to deploy a small data center for data analysis purposes. I will get the data mostly from web applications. I know I can setup a hadoop cluster and scale it as per necessity. I also know that OpenStack is a free and open-source software platform for cloud-computing, mostly deployed as an infrastructure-as-a-service (IaaS). However, it is apparent that some industries are preferring hadoop on top of OpenStack (Sahara). Thus, I want to know the difference, advantage and disadvantage of Hadoop with or without OpenStack.

In brief, if I put Hadoop on top of OpenStack, what extra features do I get?


Solution

  • Hadoop is meant to be distributed storage and computing. To facilitate that, it is originally developed keeping commodity hardware with local storage in mind. Data will be stored locally on each of the server and data on a particular server will be processed locally. The concept is called "data locality".

    Cloud platforms typically use network storage, which beats the purpose of data locality. With network storage data is not stored locally any more. However technologies like "Amazon EMR" are still extensively used for pay as you go model. Persistent data will be stored in cloud such as s3, but data can be copied to local file systems to process on "instance stores". It is kind of hybrid approach. In EMR, if we have to bring up cluster for 2 days every month, when the cluster is started data is copied from s3 to HDFS (local storage), process it and before terminating the cluster we make sure to copy the data to be persisted back to s3.

    Coming back to OpenStack, it can be used for Proof of concepts, but true production clusters might have significant performance issues.