Search code examples
sqlazure-databricksdata-warehousecdcbacpac

What is the most efficient way to generate a change data set given two SQL Server backup files?


I need to ingest data from a vendor that only provides full snapshots of their databases rather than pushing the changes or providing direct query / API access.

So the vendor uploads a SQL Server backup file every night. That file gets restored into a local SQL Server. This data is considered "current". In a stored procedure, we compare the tables to the existing (aka "prior") version that was restored the previous day to get inserts, updates, and deletes. We have to do the full compare because the source data has hard deletes.

Is there a cleaner way to do this that leverages native change data capture (CDC) functionality?

For example, is there a way to use these backup files in conjunction with Azure SQL Server's change data feed?

For more context, the eventual target is the raw/bronze zone of our data platform. We are using Azure Databricks end to end (primary compute engine and SQL endpoint).

I can't seem to find any information about this.


Solution

  • Because two different full backups will be restored as two differents databases, the only ways to proceed is to compare all the rows from all the tables.

    You can be help to do so by computing a checksum that will indicate : updates when key is the same and cheksum is different inserts when checksum appear delete when checksum disappear

    But, in anycase, there is a problem if a row has is key value changed...