Search code examples
hbasehdfsclouderaimpala

POC: Cloudera Impala + HDFS + HBase on separate cluster


I'm dealing with a Big Data system architecture. I know Impala can execute queries on data stored in HDFS/HBase cluster.

But what if I have one HDFS cluster plus another cluster where I'm keeping HBase data. Will Impala be able to execute queries merging data from both clusters?


Solution

  • First HBase stores its data in HDFS. So I am sure you have HDFS on your HBase cluster.

    When impala is reading/writing data to HDFS it is directly accessing the blocks on the OS level. This is why impala is so fast in this. When impala is reading HBase data it is becoming an HBase client using its api and not reading the HBase data directly from disk as it would be otherwise.

    Thus HBase doesn't have to be installed on the same cluster as Impala. However, the clusters need to be able to access each other.