Search code examples
google-cloud-bigtablebigtable

Which is better performant bigtable schema: Single column multi cells vs Multi columns single cell?


I need to store user interactions for 7 days in an existing BigTable table whose row key is user identifier. There are two types of interactions and we should be able to retrieve interaction history of each user in the order of time. It's obvious that the column family should have 7days as TTL and the column should contain type of interaction. I'm thinking about two options for the column, {interaction_type}:{timestamp} with the latest 1 cell and {interaction_type} with multiple cells. As the GCP bigtable doc doesn't recommend too many columns in a row, so the latter looks more reasonable. However, the column should be retrieved along with other existing columns designed by the former schema (including timestamp in column and the latest 1 cell), if I choose the latter one, the query should use interleaved filters due to the different number of cells for columns. So I wondered which one would show better read performance. Also wondered implications of one column with multi cells vs multi columns with one cell and chain filter vs interleaved filter in terms of performance in BigTable.


Solution

  • What you are talking about comes out of https://cloud.google.com/bigtable/docs/schema-design#row-keys and from what you stated, it is how you design the number of columns and in general, interleaving has a performance penalty and queries result in further fetches.

    The best design is to determine the smallest data set that is usable. i.e. combine elements into a column where that element has all the necessary fields for that result without requiring an additional column query. This is set against the need to have common elements stored uniquely, i.e. not needing to store the same field content in multiple columns (which uses more space) but there are times when it is better, i.e. makes a query return a particular column without processing another column (can be faster).

    The second option is definitely better but again the question is subjective to the access patterns, but if based purely upon performance, avoiding the interleaved filters would be better.

    Another consideration for your scenario would be : https://cloud.google.com/bigtable/docs/using-filters#cells-per-column-limit

    and the supporting mention of the overhead is here: https://cloud.google.com/bigtable/docs/using-filters#interleave