I've been trying to use the "right" technology for a 360-degree customer application, it requires:
I have tried using Hbase, the point 1 and 3 are met. But I found that doing analytics (load/save/aggregate) on HBase is painfully slow, it can be 10x slower than doing with Parquet. I don't understand why, both Parquet and Hbase are columnar DBs, and we have spread out the workload in the HBase cluster quite well ("requests per region" says so).
Any advices ? Am I using the wrong tool for the job ?
both Parquet and Hbase are columnar DBs
This asumption is wrong:
HFile
is not columnar oriented (Parquet is).HBase is painfully slow, it can be 10x slower than doing with Parquet
HBase full scan is generally much slower than the equivalent HDFS raw file scan as HBase is optimized for random access patterns. You didn't specify how exactly did you scan the table - TableSnapshotInputFileFormat
is much faster than the naive TableInputFormat
, yet still slower than raw HDFS file scan.