Search code examples
hbaseaggregateparquetnosql-aggregationcolumn-aggregation

Why is HBase full scan and aggregation slower than parquet, despite of also being columnar database?


I've been trying to use the "right" technology for a 360-degree customer application, it requires:

  1. A wide-column table, each customer is one row, with lots of columns (says > 1000)
  2. We have ~20 batch update analytics jobs running daily. Each analytics job queries and updates a small set of columns, for all the rows. It includes aggregating the data for reporting, and loading /saving the data for machine learning algorithms.
  3. We update customers' info in several columns, with <= 1 million rows per day. The update workload is spread out across working hours. We have more than 200 million rows.

I have tried using Hbase, the point 1 and 3 are met. But I found that doing analytics (load/save/aggregate) on HBase is painfully slow, it can be 10x slower than doing with Parquet. I don't understand why, both Parquet and Hbase are columnar DBs, and we have spread out the workload in the HBase cluster quite well ("requests per region" says so).

Any advices ? Am I using the wrong tool for the job ?


Solution

  • both Parquet and Hbase are columnar DBs

    This asumption is wrong:

    • Parquet is not a database.
    • HBase is not a columnar database. It is frequently regarded as one, but this is wrong. HFile is not columnar oriented (Parquet is).

    HBase is painfully slow, it can be 10x slower than doing with Parquet

    HBase full scan is generally much slower than the equivalent HDFS raw file scan as HBase is optimized for random access patterns. You didn't specify how exactly did you scan the table - TableSnapshotInputFileFormat is much faster than the naive TableInputFormat, yet still slower than raw HDFS file scan.