hbase load-balancing sharding primary-key-design

reverse domain name row key, automatic splitting, and load balancing

I'm designing an HBase schema with a row key that starts with the domain name reversed. E.g., com.example.www. Although there are many more domains that end in .com than say .org or .edu, I assume that I don't have to manage splitting myself, and I can rely on HBase's automatic splitting to distribute the rows across regions. I.e., regions will split as they get too large.

I should end up with more regions that have keys that start with com. than say org., but I assume that's okay, and the "com. regions" should end up distributed across my region servers, correct?

Is there an issue with load balancing here? In the 2011 HBase Schema Design video by Lars (the link goes directly to the section of interest), he discusses a schema design that also has the reverse domain at the beginning of the key. The video says that an MD5 hash of the reverse domain was used "for load balancing reasons".

I'm probably missing something... If some.website.com is just as likely to appear in my input as another.website.org, doesn't that mean each row is just as likely to hit one region (and even one region server) vs another?

Solution

HBase will normally split a region in 2 at it's mid point when it reaches hbase.hregion.max.filesize (depending on the split policy). You can rely on automatic splitting and but you'll end with odd and lexically uneven split points because of the nature of your rowkeys (lots of "com" domains against few "org" domains).

It may be not your exact case but think of this potential issue:

Starting with an empty table with just 1 region you insert 145M domains sequentially, starting from com.. and ending in org..
At 80 million mark (a fictitious com.nnnn.www), the region automatically splits into 2 at "com.f*", resulting in 2 40 million regions, and continues writing rows into region 2
At 120 million mark (a fictitious com.yyyy.www), the second region reaches the max filesize and splits into 2 40 million regions at "com.p*" and continues writing rows into region 3.
The job ends with the 150M domains, no more splits are performed.

Given this case, Regions 1 & 2 will store 40M rows each one but Region 3 will store 65M rows (it would be splitted at 80M, but it maybe never reach that amount). Also, since you'll write always to the last region (even with batching enabled), the job would be a lot slower than issuing batches of writes to multiple regions at the same time.

Another problem, imagine you realize you also need to add .us domains (10M). Given this design they will go to the Region 3, increasing the amount of rows hosted to 75M.

The common approach to ensure even distribution of keys among regions is to prepend to the rowkey a few chars of the md5 of the key (in this case the domain name). In HBase, the very first bytes of the row keys determine the region that will host it.

Just by prepending a few chars of the md5 would be enough to prevent as much as hotspotting as possible (one region getting too much writes) and to get good automatic splits, but it's generally recommended to pre-split tables to ensure even better splitting.

If you prepend 2 chars of the md5 to your rowkeys you can pre-split the table with 15 split points: "10", "20", "30" ... until "e0". That will create 16 regions and in case any of them needs to be automatically splitted it will be done at their mid point. i.e: When the region starting at "a0" and ending in "af" reaches hbase.hregion.max.filesize it will be splitted about "a8" and each one of the regions will store half of the "a" bucket.

This is an example of which regions would host each row if you have 16 pre-split regions with 2 char prefixed row keys:

- Region 1 ---------
0b|com.example4.www
- Region 2 ---------
1b|org.example.www
10|com.example.www
- Region 5 ---------
56|com.example3.www
- Region 10 ---------
96|org.example5.www
- Region 11 ---------
af|com.example5.www
- Region 14 ---------
d5|org.example3.www
db|com.example2.www
de|org.example2.www
- Region 16 ---------
fb|org.example4.www

Given a lot more domains it would end being much more even and almost all regions would store the same amount of domains.

In most of cases having 8-16 pre-split regions will be more than enough, but if not, you can go for 32 or even 64 pre-split regions, until a max of 256 (that would be having "01", "02", "03" ... "9f", "a0", "a1" ... until "fe")