Search code examples
joinhivehbaseapache-kudu

Hive Hbase JOIN performance & KUDU


Reading the Cloudera documentation using Impala to join a Hive table against HBase smaller tables as stated below, then in the absence of a Big Data appliance such as OBDA and a largish HBase dimension table that is mutable:

If you have join queries that do aggregation operations on large fact tables and join the results against small dimension tables, consider using Impala for the fact tables and HBase for the dimension tables. (Because Impala does a full scan on the HBase table in this case, rather than doing single-row HBase lookups based on the join column, only use this technique where the HBase table is small enough that doing a full table scan does not cause a performance bottleneck for the query.)

Is there any way to get that single key look up in another way?

In addition I noted the following on KUDU and HDFS, presumably HIVE. Does anybody have experience here? Keen to know. I will be tryiong it myself in due course, but installing parcels on non-parcelled quickstarts is not so easy...

Mix and match storage managers within a single application (or query)

• SELECT COUNT(*) FROM my_fact_table_on_hdfs JOIN
my_dim_table_in_kudu ON ...

Solution

  • Erring on the side of caution, linking with KUDU for dimensions would be the way to go so as to avoid a scan on a large dimension in HBASE when a lkp is only required.

    I am retracting the latter point, I am sure that a JOIN will not cause an HBASE scan if it is an equijoin.

    That said, IMPALA with MPP allows an MPP approach w/o MR and JOINing of dimensions with fact tables. The advantage of the OBDA is less obvious now. imo.