Search code examples
google-cloud-platformgoogle-bigqueryetlgoogle-cloud-dataflowgoogle-cloud-sql

Cloud SQL to BigQuery ETL tool


I have a Cloud SQL instance with hundreds of databases, one for each customer. Each database has the same tables in it, but data only for the specific customer.

What I want to do with it, is transform in various ways so to get an overview table with all of the customers. Unfortunately, I cannot seem to find a tool that can iterate over all the databases a Cloud SQL instance has, execute queries and then write that data to BigQuery.

I was really hoping that Dataflow would be the solution but as far as I have tried and looked online, I cannot find a way to make it work. Since I spent a lot of time already on investigating Dataflow, I thought it might be best to ask here.

Currently I am looking at Data Fusion, Datastream, Apache Airflow. Any suggestions?


Solution

  • Why Dataflow doesn't fit your needs? You could run a query to find out the tables, and then iteratively build the Pipeline/JdbcIO sources/PCollections based on those results. Beam has a Flatten transform that can join PCollections.

    What you are trying to do is one of the use cases why Dataflow Flex Templates was created (to have dynamic DAG creation within Dataflow itself) but that can be pulled without Flex Templates as well.