Search code examples
hadoophivedata-warehouse

Hive - Is it a good fit for building a datawarehouse?


So like most Enterprise companies, we have built a data warehouse in Hadoop, with user queries supported in Hive, and now after a few months and user acceptance testing everyone is a little surprised about how it is not like a standard (Oracle/Netezza) database when used by end-users for ad-hoc data analysis. While I understand that this is probably a very stupid way of doing projects (we should have researched the use cases and best fit technologies before building the product), and I know the basic technical aspects of how Hadoop differs from single node machines... I would still want to understand if using Hadoop/Hive makes sense for data warehouses in any scenario? For instance,

  • Are there always trade-offs in query performance or can they be optimized with configuration changes, horizontal scaling of hardware?
  • Can it ever be as fast as something like Netezza - which uses non-commodity hardware but functions on a similar architecture?
  • Where is Hadoop great and absolutely defeats everything else in comparison?

Solution

  • I would argue the Hive MetaStore is useful more than HiveServer2 itself as the query interface.

    The MetaStore is what Presto and Spark use to get data much quicker than MapReduce, but maybe not as fast as a well-optimized Tez query, and improvements are being made in Hive v2.x+ with LLAP, for example.

    In the end, Hive is really only useful if the ingestion pipelines are actually storing the data in columnar formats of ORC or Parquet to begin with. From there, and reasonable query engine can scan through that data fairly quickly, and Hive just happens to be considered the defacto implementation of that access pattern, whereas Impala or Presto are often more used for adhoc access.

    That being said, Hive (and other SQL on Hadoop) is not used for "building", it is used for "analyzing"

    And I don't know what you mean by "standard" - Hive supports any ODBC/JDBC Connection, so it's not like you go to the CLI for all access, and HUE or Zeppelin make really nice notebooks for SQL analysis over Hive.