Search code examples
hadoophbasehdfsbigdata

HBase table size decreases after period of time


We have one problem with storing data in HBase. We've taken such steps:

  1. Big csv file (size: 20 G) is being processed by Spark application with hfiles as result (result data size: 180 G).
  2. Creation of table by using command: 'TABLE_NAME', {'NAME'=>'cf', 'COMPRESSION'=>'SNAPPY'}
  3. Data from created hfiles are bulkloaded with command hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=1024 hdfs://ip:8020/path TABLE_NAME

Right after loading of table the size is 180 G, however after some period of time (yesterday it was at 8pm, two days ago around 8am) a process being launched which compacts data to size 14 G.

My question is what is the name of this process? Is that a major compaction? Becouse I'm trying to trigger compaction (major_compact and compact) manually, but this is an output from command started on uncompacted table:

hbase(main):001:0> major_compact 'TEST_TYMEK_CRM_ACTION_HISTORY'
0 row(s) in 1.5120 seconds

Solution

  • This is compactions process. I can suggest the following reason for such big difference in table size. Using Spark application, you will not use a compression codec for HFile, because it specifies it after file creation. HFiles attachment to the table doesn't change it formate (all files in HDFS are immutable). Only after compaction process, data will be compressed. You can monition compaction process via HBase UI; it usually ran on 60000 port.