Search code examples
google-bigqueryimmutabilitysharding

BigQuery table design for immutable data


Background

We're probably going to use BigQuery to store our immutable business events so that we can replay them later to other services. I'm thinking that one approach would be to essentially just store each event as a blob (with some metadata). In order to replay them easily it would of course be nice to maintain a global order of our events and just persist each event to the same table in BigQuery. We probably have something like 10 events per second (which is nowhere near the limit of 100000 messages per second).

Question

  1. Would it be ok to simply persist all events in the same table?
  2. Would it perhaps be better to shard messages in different tables (perhaps based on event type, topic or date)?
  3. If (2), is it possible to join/scan through multiple tables sorted by time so that it's possible to replay events in the same order?

Solution

  • If you primary usage scenario to store events and then reply them - there is no reason to split different event types into different tables. Especially since each event is an opaque blob. Keeping them all in the same table will have small benefit of you being able to do analysis by types of events and other metadata. Sharding by days makes sense, especially if you will be looking at the most recent data - this will help you to keep the BigQuery query costs down.

    But I was worried about your requirement of replying events in order. There is no clustered index in BigQuery, so every time you will need to reply your events, you will have to use "ORDER BY timestamp" in your query, and it can scale only to relatively small amount of data (tens of megabytes). So you will want to replay a lot of events - this design won't work for you.