Search code examples
hadoophbaseparquetapache-phoenix

Storing data in HBase vs Parquet files


I am new to big data and am trying to understand the various ways of persisting and retrieving data. I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase. My questions are :

  1. What is the use case of using Parquet instead HBase
  2. Is there a use case where Parquet can be used together with HBase.
  3. In case of performing joins will Parquet be better performant than HBase (say, accessed through a SQL skin like Phoenix)?

Solution

  • As you have already said in question, parquet is a storage while HBase is storage(HDFS) + Query Engine(API/shell) So a valid comparison should be done between parquet+Impala/Hive/Spark and HBase. Below are the key differences -

    1) Disk space - Parquet takes less disk space in comparison to HBase. Parquet encoding saves more space than block compression in HBase.

    2) Data Ingestion - Data ingestion in parquet is more efficient than HBase. A simple reason could be point 1. As in case of parquet, less data needs to be written on disk.

    3) Record lookup on key - HBase is faster as this is a key-value storage while parquet is not. Indexing in parquet will be supported in future release.

    4) Filter and other Scan queries - Since parquet store more information about records stored in a row group, it can skip lot of records while scanning the data. This is the reason, it's faster than HBase.

    5) Updating records - HBase provides record updates while this may be problematic in parquet as the parquet files needs to be re-written. A careful design of schema and partitioning may improve updates but it's not comparable with HBase.

    By comparing the above features, HBase seems more suitable for situations where updates are required and queries involve mainly key-value lookup. Query involving key range scan will also have better performance in HBase.

    Parquet is suitable for use cases where updates are very few and queries involves filters, joins and aggregations.