Search code examples
google-bigquerymaterialized-views

Materialized view for streaming data in Big Query


I have a a big table with Json records in Bigquery and that table is is getting updated with streaming data almost real time from external cloud database (firestore). I want to create another table by flattening the json record in rows and columns out of this big table. I am using DBT to do the transformation.

Every time I need updated data from the base table, I need to recreate the flattened table. I want my table to be updated in real time as well from base table. What would be the best option rather than creating as a table.

I read about views and materialized views but not sure what would be the best option to get real time streaming data from base table. Do I need to manually/auto refresh? and which one would be cost affective?

Thanks Ashish

materialized veiw but not sure if it needs to be auto refreshed in case of streaming data


Solution

  • Think of it in this way, when you query a view you are essentially running the SQL of the view every time you query the table, this enables you to present a 'table' which is always up to date.

    This would be ideally suited to transforming an underlying table that is updated in realtime, as every time you query it you will be seeing the latest data.

    The downside to views is that their performance can be slower, as data is not pre-aggregated and the view SQL must run before the table results can be returned.

    A materialized view is a hybrid of the two, you get the performance benefits of a table, which will be kept up to date as the underlying data refreshes. This would be ideal for a table that refreshes, for example, once an hour, meaning you don't need to build an additional process to handle it's refresh.

    In my opinion, it would not be suitable for a table which updates in realtime, as it would essentially be refreshing constantly and thus would cost more in query fees. If you use a view instead, the table will be rebuilt on demand each time you require the data.

    Hopefully the above helps you make the right choice depending on your use cases.