time-series key google-cloud-bigtable bigtable

Does putting multiple time-series in one BigTable table avoid hotspotting?

Say I have 3 unrelated time-series. Each written row key starts with the current timestamp: timestamp#.....

Having each time-series in a separate table will cause hotspotting because new rows are always added at one extremity (latest timestamp).

If we join all 3 time-series in one BigTable table with prefixes:

series1#timestamp#....
series2#timestamp#....
series3#timestamp#....

Does that avoid hotspotting? Will each cluster node handle one time-series?

I'm assuming that there are 3 nodes per cluster and that each of the 3 time-series will receive similar load and will grow in size evenly.

If yes, is there any disadvantage to having multiple unrelated time-series in one BigTable table?

Solution

Because you have a timestamp as the first part of your rowkey, I think you're going to get hotspots either way.

In a Bigtable instance, your data is split into groups of contiguous rowkeys (called tablets) and those are distributed evenly across the nodes. To maximize efficiency with Bigtable, you need that data to be distributed across the nodes and within the nodes as tablets. You get hotspotting when you're writing to the same row or contiguous set of rows since that is all happening within one tablet. If you are constantly writing with the timestamp as a prominent part of the key, you will keep writing to the same tablet until it fills up and you have to go to the next one rather than writing to multiple tablets within a node.

The Bigtable documentation has a guide for time-series schema design which recommends a few solutions for a use case like yours:

Field promotion: add an additional field to the rowkey before your timestamp to separate out a group of data (USER_ID#timestamp#...)
Salting: take a hash of the timestamp and divide it by the number of nodes, then add that to the rowkey (SALT_RESULT#timestamp#...)
Reverse timestamps: or if either of those don't work, reverse the timestamp. This works best if your most common query is for the latest values, but can make other queries more difficult

Edit: Your approach is definitely similar to salting, but since your data is already in separate tables you're actually not getting any increased benefit since the hotspotting is going to be caused at the tablet level.

To draw it out more, let's say you have this data in separate tables and start writing data. Each table is going to be composed of tablets, which capture timestamps 0-10, 11-20, etc... Those tablets will automatically be distributed amongst nodes for the best performance. If the loads are all similar, tablets 0-10 should all be on separate nodes, 11-20 will all be on separate nodes etc.

With the way your schema is set up, you are constantly writing to the latest tablet (let's say the time is now 91,) you're only writing to the 91-100 while ignoring all the other tablets within that node. Since that 91-100 tablet is the only one getting work instead of the other tablets, your node isn't going to give you optimized performance and this is what we refer to as hotspotting. A certain tablet is getting a spike, but there wont be enough time for the load balancer to correct it.

If you have it in the same table, we can just focus on one node now. series1#0-10 will first get slammed, then series1#11-20, then series1#21-30. There is always one tablet that is getting too much load and not making use of the full node.

There is some more information about load balancing in the documentation.