Search code examples
hivehbaseapache-pig

Updating HBase table only if value is different


I am loading data into HBase via Pig. The pig script runs daily to look for updated records in various hive tables, performs joins and processing, then loads to HBase. The problem I'm having is that sometimes one part of the record is updated, but not other parts.

Example: Record with key abcd123 exists in hive table 1 and hive table 2. In hive table 1, there is new data, but not in hive table 2. My pig script joins both tables and then loads the joined record to hbase, updating the existing record in hbase for key abcd123.

Is there a way that I can have HBase check to see if the data currently in hbase for the key is different from what the pig script is attempting to load, and then only accept the write of the different values? No point in updating the row with bunches of data that hasn't changed just to get the one value which has changed.


Solution

  • You need to develop your custom solution to achieve this, you may try:

    Approach 1- You could maintain two copy of your dataset, first is in hive/pig and another one in hbase, keep both always in-sync. Whenever you receive new changes(table 1 or table 2) join table 1 and table 2 then compare old dataset with newly created dataset to find out the delta records which needs to be updated. Update delta changes to hbase.

    Approach 2- Append date column to both tables (table 1 and table 2) which will help you to find out delta changes. Join both tables and use the date column to find out delta chagnes. (add comment if you need more details).
    For every records of delta changes, query hbase using row key then do comparison between new values with hbase values if something has been changed then update rowkey values in hbase.