Search code examples
azureazure-data-factoryetl

Azure Data Factory - Initial Load SQL Server then CDC


I have a task where I need to load 2 years of data from a table to Data lake using ADF and then continue using CDC.

I am new to ADF and was wondering how to achieve this since CDC on the table is yet to be activated. My question would be how to move from initial load further with CDC without missing data or ingesting all table.

Also with CI/CD how would the pipeline work for stg env in terms of being scheduled or not? Only working with subset of data for testing purpose etc?

Any guidance would be appreciated!


Solution

  • I will break your question in 3 parts :

    1st part : loading the 2 years of data requires an ADF pipeline that uses a Copy Activity to move data. You need just to configure your source dataset to point to your table and your sink dataset to your Data Lake storage. Since you mentioned that you have a large dataset, consider optimizing the copy activity by using data integration units (DIU) for parallel processing and staging if necessary to speed up the transfer.

    https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview

    https://learn.microsoft.com/en-us/azure/data-factory/copy-data-tool?tabs=data-factory

    https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance

    2nd part: CDC After Initial Load

    If CDC is not already enabled on your source database, you will need to enable it and it depends on your database type.

    You can also create another ADF pipeline that uses the CDC feature with a combination of change tracking features of your source database

    For SQL databases, ADF provides a built-in CDC feature for some sources or you can manually implement it using watermarks (a column with a sequential identifier like a timestamp or an auto-incrementing primary key).

    To make sure that you don't miss any data between the initial load and the start of the CDC, you might need to:

    • Overlap the initial load and CDC timeframe slightl

    • Use a high-watermark from the initial load to ensure CDC picks up exactly where the initial load left off

    https://learn.microsoft.com/en-us/azure/data-factory/concepts-change-data-capture

    CI/CD and triggers :

    For CI/CD in, you can manage your code in a source control system (like Azure Repos, GitHub) and automate deployments using Azure DevOps or GitHub Actions.

    https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-delivery

    When it comes to development to staging to production, you can use ARM templates for resource deployment across environments with the parameters in your templates to handle environment-specific settings such as connections strings or file paths.

    You can use the trigger to schedule pipeline runs and if you are looking continuous processing consider using tumbling window triggers for continuous processing.

    https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers