Search code examples
hivemapreducehdfshbasebatch-processing

How to process all Hbase data with Hive


I have a HBase with the 750GB data. All data in the HBase are time series sensor data. And, my row key design is like this;

deviceID,sensorID,timestamp

I want to prepare all data in the hbase for batch processing(for example, CSV format on the HDFS). But there is a lot of data in the hbase. Can I prepare data using hive without getting data partially? Because, if I will get data using sensor id(scan query with start-end row), I must specify start and end row for each time. I don't want do this.


Solution

  • You can try using Hive-Hbase integration and then map hbase table data to hive table.

    Then by using Hive-Hbase table we can create full dump of Hbase table to Regular Hive table(orc,parquet..etc).

    Step-1:Create HBase-Hive Integrated table:

    hive> CREATE EXTERNAL TABLE <db_name>.<hive_hbase_table_name> (key int, value string) 
          STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
          WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
          TBLPROPERTIES ("hbase.table.name" = "<hbase_table_name>");
    

    Step-2:Create Hive Dump of Hbase table:

    hive> create table <db_name>.<table_name> stored as orc as 
             select * from <db_name>.<hive_hbase_table_name>;
    

    Step-3: Exporting to CSV format:

    hive> INSERT OVERWRITE  DIRECTORY <hdfs_directory> 
          ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
          select * from <db_name>.<hive_hbase_table_name>;
    

    Refer to this link for more details/options regards to exporting hive table.