Search code examples
hadoophbasehdfshadoop2bulk-load

Can I do cell merging in hbase?


Suppose that I have a column that is updated incrementally other than overwriting (like a bitwise-ored integer flag or a sum column). For example (assuming only 1 version):

Existing Cell: [key: 'k1', f1:sum: 100]
Upcomming New Cell: [key: 'k1', f1:sum: 200]

Then I want to update the cell data this way: sum = 100+200 = 300. Yielding the final record:
[key: 'k1', f1:sum: 300]

Here I want to MERGE the new cell into the old one with the same key. How can I achieve this? A direct put would simple overwrite the old cell. (Again only one version is maintained).

I come up with some ideas but they don't seem satisfying though:

1> On the client side first get the old value then add sum to the on going put object.

2> Use a coprocessor. In RegionObserver.prePut I do a get, add and modify the final put object. This pushes the computation to the server side but still needs an extra query(get) (which could expensive) first.

Besides even the above work in real-time query scenario but what about bulk-load data merging?

I've been going through the documents for quite a while but can't find a clue yet. I'll really appreciate it if you could share some idea on this.

I'm using hbase-1.2.6. Thanks!


Solution

  • If I understand your use case correctly and the values are going to be long integers, then, I think that HBase increment operation should work for you. Look at the HBase 1.2.6 javadoc for Increment for details.

    If it's not the arithmetic increment that you want, HBase also has an Append operation too, which can be used to atomically append more data to an existing cell.

    Note that the javacdoc mentions that Increments and Appends guarantee atomicity for writes, but not for reads, which is incorrect. They actually do guarantee atomicity for reads too (since HBase 0.95), and this was fixed in the docs in later releases.

    Also, both the Increment and Append operations don't do an extra Get rpc. They work by taking a row lock on the server side and then doing a read followed by write on the server under the same lock.