Search code examples
google-cloud-data-fusion

Cloud Data Fusion ETL from PostGres to BigQuery - idempotent load


I'm trying to use Google's Cloud Data Fusion (CDF) to perform an ETL of some OLTP data in PostGres into BigQuery (BQ). We will copy the contents of a few select tables into an equivalent table in BQ every night - we will add one column with the datestamp.

So imagine we have a table with two columns A & B, and one row of data like this in PostGres

|--------------------|
|    A    |    B     |
|--------------------|
|  egg    |  milk    |
|--------------------|

Then over two days, the BigQuery table would look like this

|-------------------------------|
|    ds    |    A    |    B     |
|-------------------------------|
| 22-01-01 |   egg   |   milk   |
|-------------------------------|
| 22-01-02 |   egg   |   milk   |
|-------------------------------|

However, I'm worried that the way I am doing this in CDF is not idempotent and if the pipeline runs twice I'll have duplicate data for a given day in BQ (not desired)

One idea is to delete rows for that day in BQ before doing the ETL (as part of the same pipeline). However, not sure how to do this, or if it is best practice. Any ideas?


Solution

  • You could delete the data in a BigQuery action at the start of the pipeline, though that runs into other issues if people are actively querying the table, or if the delete action succeeds but the rest of the pipeline fails.

    The BigQuery sink allows you to configure it to upsert data instead of inserting. This should make it idempotent as long as your data has a key that can be used.

    Some other possibilities are to place a BigQuery execute after the sink that runs a BigQuery MERGE, or to write a custom Condition plugin that queries BigQuery and only runs the rest of the pipeline if data for the date does not already exist.