Search code examples
databaseencryptiongoogle-cloud-platformhbasebigtable

Avoiding hotspotting in BigTable or HBase by using SHA1 keys


I'm using Google BigTable to store event log data according to the following constraints:

  • Each key should contain a username and timestamp, allowing contiguous reads for time-series data on a per-user basis, like this: USERNAME_TIMESTAMP.
  • I will be storing up to 10,000,000 event logs or more per day, and so naturally, I need to avoid hotspotting and ensure that I am evenly distributing records across each node.
  • There is a massive security component to this database, and as such, I'd like to encrypt the username before using it as a key in BigTable.

Obviously, I'd like to avoid doing extra steps whenever I read or write, so I was thinking of encrypting usernames using SHA1 before adding them as a key in BigTable. As a result, all keys in BigTable will now be formatted like this:

cf23df2207d99a74fbe169e3eba035e633b65d94_2018_01_30_15090001

We know that SHA1 is normally distributed, so given that, is it safe to assume that all of my records will be evenly distributed across nodes, while ensuring that all usernames will reside together? Will this in effect prevent hotspotting? Are there any edge cases in this approach that I've missed?


Solution

  • Assuming that User Id is well distributed (i.e. there isn't a user that will have more than 10K operations per second), this approach should be fine.

    FYI, Cloud Bigtable measures operations in rows per second, and you want to consider your peak throughput in determining the number of nodes. Each node can support 10,000 simple reads or writes per second. Our smallest production configuration is 3 nodes, which can support up to 30,000 rows per second (2.6 Billion rows per day if used continuously at the maximum).