Search code examples
hbasehbase-filterbigdatadatabase

HBase Scan - RowKey Filters


So, let's see if I can explain briefly my issue.

Imagine we got an HBase table that has the information of every visit to a disco: Every disco registers its name, the name of the visitor, and the day he visited it. (yes it's a dumb example, I know..).

So, for example, these would be some values of the table:

..
ministryOfSoundJamesOliver01022017
ministryOfSoundJamesOliver02022017
ministryOfSoundJamesOliver03022017
ministryOfSoundOliviaNewton04042017
ministryOfSoundOliviaNewton06042017
...
pachaibizaJohnMcKiness06042017
pachaibizaJohnMcKiness04042017
pachaibizaWilliamForrester04042017
..

The RowKey has the following structure:

discoName

personName

dayOfTheYear

(the table has some other columns/qualifiers, but I don't mind about them for this issue).


The issue is: imagine a boy that simply loves going to Ministry Of Sound. He just loves it, he spends all his money in disco and drugs (but that's not the point here).

My goal is to output every person who attended Ministry Of Sound. In my scan, this dude keeps appearing in the results, so I must discard a lot of entries in search of the next visitor. F.E:

..
ministryOfSoundJohnnyYonkie01022017
ministryOfSoundJohnnyYonkie02022017
ministryOfSoundJohnnyYonkie03022017
ministryOfSoundJohnnyYonkie04022017
ministryOfSoundJohnnyYonkie05022017
ministryOfSoundAnotherDude02022017
...

In order to register AnotherDude, I must discard 4 entries from Johnny.

Finally, the question is:


Is there any way to tell HBase that the repetitive entries from byte(x) to byte(x+y) [ x being the number of bytes from discoName and y number of bytes from personName ] must be automatically discarded?


Thanks a lot in advance!!


Solution

  • First things first: If you only have client access, I can't help you :(

    If you have additional access, then you could look at the following propositions, but the default reply would be: If this is your access pattern, optimize your schema for it.

    If you need to access data in a certain way, make sure you write it in that way, in the first place. Use the map-reduce API if you have to perform migrations.

    I would probably simply add a table which merely writes a row ministryOfSound and a column per visitor. (In general, the schema you propose doesn't sound very well suited for HBase - since you have a bunch of writes with monotonically increasing rowkeys, if post-processing the duplicate results away is really a performance issue)

    On the other hand, if this is an ad-hoc query, then you probably want to use the mapreduce-API straight away - maybe using the Apache Spark-interconnect and perform a "distinct" call on the data.

    Using Scans for analytical queries isn't how I would do it.

    If you had to do it using Scans, then I would recommend you implement a CoProcessor. These can augment Filter with state, and you can project the results of a PrefixFilter'd Scan on the Region Server side. If you're new to CoProcessors, here's an introduction: HBase: The Definitive Guide. This requires that you can deploy jars into the RegionServer classpath.

    But again, if you blow up your client by doing a distinct filtering there, you're probably also blowing up your regions due to hotspots on the inserts.

    As a final alternative: You might want to look at Apache Phoenix, and see if you can coerce your rowkey into a schema, from which you can do a select distinct on the first two parts of the rowkey. This would obviously require that you have delimiter in your rowkey, or at least a fixed length.