Search code examples
google-bigquerygoogle-cloud-dataflowairflowgoogle-cloud-sqlapache-beam

Dump materialize aggregation from BigQuery to SQL server, Dataflow vs Airflow


I use a BigQuery dataset as data lake to store all records/events level data, and a SQL server to store aggregated reports that are updated regularly. Because the reports will be accessed frequently by clients via web interface, and each report aggregates large amount of data, so storing it BigQuery is a no go.

What is the best practise for doing this? Internally we have 2 ideas running around:

  1. Run a Dataflow batched job every X hr to recalculate the aggregation and update the SQL server. It will need a scheduler to trigger the job, and the same job can be used to backfill all data.

  2. Run an Airflow job that does the same thing. A separate job will be needed for backfill (but can still share most of the code with the regular job)

I know Dataflow does well in terms of processing chunks of data in parallel, but I wonder about Airflow's performance, as well as the risk of exhausting connection limit


Solution

  • Please check this answer from a previous similar question

    In conclusion: Using Airflow will result in a more efficient way to manage all the process from the workflow. A solution that Google offers based on Airflow is Cloud Composer.