Search code examples
hivemapreducehbasehadoop2spark-avro

Non HBase solution for storing Huge data and updating on real time


Hi i have developed an application where i have to store TB of data for the first time and then 20 GB monthly incremental like insert/update/delete in the form of xml that will be applied on top of this 5 TB of data . And finally on request basis i have to generate full snapshot of all data and create 5K text files based on the logic so that respective data should be in the respective files .

I have done this project using HBase . I have created 35 tables in the HBase having region from 10 to 500 . I have my data in my HDFS and the using mapreduce i bulk load data into receptive Hbase tables .

After that i have SAX parser application written in java to parse all incoming xml incremental files and update HBase tables .The frequency of the xml files are approx 10 xml files per minutes and total of 2000 updates . The incremental message are strictly in order .

Finally on request basis i run my last mapreduce application to scan all Hbase table and create 5K text files and deliver it to the client .

All 3 steps are working fine but when i went to deploy my application on production server that is shared cluster ,the infrastructure team are not allowing us to run my application because i do full table scan on HBase .

I have used 94 node cluster and the biggest HBase table data that i have is approx 2 billions .All other tables has less than a millions of data .

Total time for mapreduce to scan and create text files takes 2 hours.

Now i am looking for some other solution to implement this .

I can use HIVE because i have records level insert/update and delete that too in very precise manner.

I have also integrated HBase and HIVE table so that for incremental data HBase table will be used and for full table scan HIVE will be used . But as HIVE uses Hbase storage handler i cant create partition in HIVE table and that is why HIVE full table scan becomes very very slow even 10 times slower that HBase Full table scan

I cant think of any solution right now kind of stuck . Please help me with some other solution where HBase is not involved .

Can i use AVRO or perquet file in this use case .But i am not sure how AVRO will support record level update .


Solution

  • I will answer my question . My issue is that i dont want to perform full table scan on Hbase because it will impact performance of the region server and specially on the shared cluster it will hit the read wright performance on of the HBase .

    So my solution using Hbase because it is very good for the update specially delta update that is columns update .

    So in order to avoid that Full table scan take snapshot of HBase table export it to the HDFS and them run full table scan on the Hbase table snapshot.

    Here is the detailed steps for the process

    Create snapshot

    snapshot 'FundamentalAnalytic','FundamentalAnalyticSnapshot'
    

    Export Snapshot to local hdfs

    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot FundamentalAnalyticSnapshot -copy-to /tmp -mappers 16
    

    Driver Job Configuration to rum mapreduce on Hbase snapshot

    String snapshotName="FundamentalAnalyticSnapshot";
    Path restoreDir = new Path("hdfs://quickstart.cloudera:8020/tmp");
    String  hbaseRootDir =  "hdfs://quickstart.cloudera:8020/hbase";
    
    
    
     TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Snapshot name
                            scan, // Scan instance to control CF and attribute selection
                            DefaultMapper.class, // mapper class
                            NullWritable.class, // mapper output key
                            Text.class, // mapper output value
                            job,
                            true,
                            restoreDir);
    

    Also running mapreduce on Hbase snapshot will skip scan on Hbase table and also there will be no impact on region server.