Search code examples
hadoopvoltdbnosql

Enterprise Data warehouse with NOSQL /Hadoop - "NO RDBMS"


Are there any EDW (enterprise data warehouse) systems designed using NOSQL/Hadoop solution ?

I know there are PDW systems(MS PDW polybase, Greenplum hawq etc) which connect to HDFS sub-systems. These are proprietary hardware and software solutions and are expensive at scale.I am looking for a solution with NOSQL or Hadoop and preferably open source for enterprise data warehouse solution. I would like to hear any of your experiences if you have implemented any. Just to mention again, I am not looking for any type of proprietary RDBMS as a player in this EDW solution.

I did some research on the internet, though it's possible(Impala is a possible option) but did not see anyone really implemented completely with NOSQL or Hadoop.

If you have done something of this type, I would like to hear how you designed and what different tools that are used by your business analysts etc... If you can share your experience along the journey that would be really appreciated.

Updating.... How about VoltDb and NEOdb (which are not true RDBMS) but they claim that they can support ANSI SQL to a greater extent.


Solution

  • First problem you will face with building the EDW on top of Hadoop is the fact that its storage is not updatable, so you should forget about SQL UPDATE and DELETE commands.

    Second, solution built on top of Hadoop is usually times more expensive to maintain. More expensive specialists, more complex debugging (compare debugging the problem in Hive query vs SQL query problems in Oracle - which would be easier).

    Third, Hadoop usually gives you much less concurrency and much higher latency for any type of workload you put on top of it.

    Given all of this, why do you think DWH is built on top of Hadoop only for really big enterprises like Facebook, Yahoo, Ebay, LinkedIn and so on? Because it is not that simple to do, while when implemented it can be more scalable and more customizable than any proprietary solution.

    So if you are clearly decided to go on with Hadoop or any other NoSQL solution to build your DWH, I would recommend you this:

    1. Use Hadoop HDFS as a base for data storage
    2. Use Flume for data loading into the HDFS
    3. Use Hive with Tez for heavy ETL jobs
    4. Provide Impala as a SQL query interface for analysts
    5. Provide Spark as an advanced instrument for analysts
    6. Use Ambari for management and provisioning of all of tools together

    These tools together will cover most of your needs