Search code examples
hadoophivehbasehiveqlimpala

Will all SQL operation work while accessing Hbase tables using Impala?


HBase doesn't allow join operation on its tables. To overcome this, I am planning to create HBase table and access it through Impala.

Impala allows all joins along with group by and other SQL operation. I have few question related to it -

  1. Has anyone tested this approach?

  2. Will all SQL operation available work equally well as impala works with Hive?

I tried finding answer in cloudera's documentation but there is not clear answer.


Solution

  • The word "Clear answer" depends on what parameters you are looking for...

    Q1 : Has anyone tested this approach?

    • Yes this approach of Impala - hbase external tables is possible, as we have done it the same for adhoc queries. However, our system has not yet been product-ionized. One thing is We are using Impala(Hbase external tables) to query (with joins and group by as you mentioned) raw data (i.e. adhoc queries, before doing any processing...) after our data-ingestion step through spark steaming . Since we are using cloudera, our option is obvious i.e impala for faster response. Earlier we used to do same with Hive (Hbase External tables).

    Note (additional information in the context of External tables) :

    • Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. user has to reissue the query. For ETL jobs where FT has paramount importance hive is good fit.

    • Impala is faster than Apache Hive but that does not mean that it is the one stop SQL solution for all big data problems. Impala is memory intensive and does not run effectively for heavy data operations like joins because it is not possible to push in everything into the memory. This is when Hive comes to the rescue. If an application has batch processing kind of needs over big data then organizations must opt for Hive. If they need real time processing of ad-hoc queries on subset of data then Impala is a better choice.

    Q2 : Will all SQL operation available work equally well as impala works with Hive?

    Please see below diagram for further information from this article since plain link I shouldn't use in SO, I'm using the diagram from the article published for better understanding. additional information about impala & hive for decision support