Search code examples
hadoopapache-sparkhbasehdfs

Spark with HBASE vs Spark with HDFS


I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.

Questions:

1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?

2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?


Solution

  • 1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?

    At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.

    2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?

    Small reads, concurrent write/read patterns, incremental updates (most etl)

    Good luck...