HBase doesn't allow join operation on its tables. To overcome this, I am planning to create HBase table and access it through Impala.
Impala allows all joins along with group by and other SQL operation. I have few question related to it -
Has anyone tested this approach?
Will all SQL operation available work equally well as impala works with Hive?
I tried finding answer in cloudera's documentation but there is not clear answer.
The word "Clear answer" depends on what parameters you are looking for...
Q1 : Has anyone tested this approach?
Note (additional information in the context of External tables) :
Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. user has to reissue the query. For ETL jobs where FT has paramount importance hive is good fit.
Impala is faster than Apache Hive but that does not mean that it is the one stop SQL solution for all big data problems. Impala is memory intensive and does not run effectively for heavy data operations like joins because it is not possible to push in everything into the memory. This is when Hive comes to the rescue. If an application has batch processing kind of needs over big data then organizations must opt for Hive. If they need real time processing of ad-hoc queries on subset of data then Impala is a better choice.
Q2 : Will all SQL operation available work equally well as impala works with Hive?
Either Impala's external tables of Hbase or Internal tables of Impala SQL doesn't change. Impala (SQL syntax follows the SQL-92 standard) and Hive share the same metastore database and their tables are often used interchangeably.
Please see below diagram for further information from this article since plain link I shouldn't use in SO, I'm using the diagram from the article published for better understanding.