Search code examples
javahadoopmapreducehbasehadoop2

Querying Hbase efficiently


I'm using Java as a client for querying Hbase.

My Hbase table is set up like this:

ROWKEY     |     HOST     |     EVENT
-----------|--------------|----------
21_1465435 | host.hst.com |  clicked
22_1463456 | hlo.wrld.com |  dragged
    .             .             .
    .             .             .
    .             .             .

The first thing I need to do is get a list of all ROWKEYs which have host.hst.com associated with it.

I can create a scanner at Column host and for each row value with column value = host.hst.com I will add the corresponding ROWKEY to the list. Seems pretty efficient. O(n) for getting all rows.

Now is the hard part. For each ROWKEY in the list, I need to get the corresponding EVENT.

If I use a normal GET command to get the cell at (ROWKEY, EVENT), I believe a scanner is created at EVENT which takes O(n) time to find the correct cell and return the value. Which is pretty bad time complexity for each individual ROWKEY. Combining the two gives us O(n^2).

Is there a more efficient way of going about this?

Thanks a lot for any help in advance!


Solution

  • What is your n here?? With the RowKey in hand - I presume you mean the HBase rowkey - not some handcrafted one?? - that is fast/easy for HBase. Consider that to be O(1).

    If instead the ROWKEY is an actual column you created .. then there is your issue. Use the HBase provided rowkey instead.

    So let's move on - assuming you either (a) already properly use the hbase provided rowkey - or have fixed your structure to do so.

    In that case you can simply create a separate get for each (rowkey, EVENT) value as follows:

    Perform a `get` with the given `rowkey`. 
    In your result then filter out EVENT in <yourEventValues for that rowkey>
    

    So you will end up fetching all recent (latest timestamp) entries for the given rowkey. This is presumably small compared to 'n' ?? Then the filtering is a fast operation on one column.

    You can also speed this up by doing a batched multiget. The savings comes from reduced round trips to the HBase master and parsings/plan generation by the master/region servers.

    Update Thanks to the OP: I understand the situation more clearly. I am suggesting to simply use the "host | " as the rowkey. Then you can do a Range Scan and obtain the entries from a single Get / Scan.

    Another update

    HBase supports range scans based on prefixes of the rowkey. So you have foobarRow1, foobarRow2, .. etc then you can do a range scan on (foobarRow, foobarRowz) and it will find all of the rows that have rowkeys starting with foobarRow - and with any alphanumeric characters following.

    Take a look at this HBase (Easy): How to Perform Range Prefix Scan in hbase shell

    Here is some illustrative code:

    SingleColumnValueFilter filter = new SingleColumnValueFilter(
       Bytes.toBytes("columnfamily"),
       Bytes.toBytes("storenumber"),
       CompareFilter.CompareOp.NOT_EQUAL,
       Bytes.toBytes(15)
    );
    filter.setFilterIfMissing(true);
    Scan scan = new Scan(
       Bytes.toBytes("20110103-1"),
       Bytes.toBytes("20110105-1")
    );
    scan.setFilter(filter);
    

    Notice that the 20110103-1 and 20110105-1 provide a range of rowkeys to search.